- The paper introduces a novel end-to-end network that regresses 6D poses by guiding direct regression with intermediate geometric features.
- It leverages continuous 6D rotation and scale-invariant translation parameterizations to overcome the limitations of traditional PnP/RANSAC methods.
- The Patch-PnP module efficiently utilizes dense geometric maps to achieve real-time, robust performance on standard evaluation benchmarks.
This paper introduces GDR-Net (Geometry-guided Direct Regression Network), a novel approach for monocular 6D object pose estimation (predicting 3D rotation and translation from a single RGB image) that aims to combine the strengths of two dominant strategies: indirect (geometry-based) and direct (regression-based) methods.
Traditional indirect methods first establish 2D-3D correspondences between image pixels and points on the object's 3D model, then use algorithms like PnP (Perspective-n-Point) within a RANSAC scheme to solve for the pose. While accurate, these methods are typically not end-to-end differentiable due to the non-differentiable PnP/RANSAC step, limiting their use in tasks requiring differentiable poses (e.g., self-supervised learning) and potentially hindering optimal training as the correspondence loss is only a surrogate for the final pose error. Direct methods regress the 6D pose directly from image features, making them end-to-end trainable but generally less accurate than indirect methods.
GDR-Net proposes an end-to-end differentiable network that directly regresses the 6D pose but guides this regression using intermediate geometric representations similar to those used in indirect methods.
Key Contributions and Methodology:
- Revisiting Direct Regression: The authors first analyze direct regression methods and identify key factors for improving their performance:
- Rotation Parameterization: Using a continuous 6-dimensional representation (R6d) derived from the first two columns of the rotation matrix, specifically in its allocentric form (viewpoint-invariant under translation), significantly outperforms discontinuous representations like quaternions or Lie algebra vectors.
- Translation Parameterization: Employing a Scale-Invariant Translation Estimation (SITE) representation (tSITE), which normalizes the projected 2D center offset (δx,δy) by the bounding box dimensions and the depth estimate (δz) by the zoom-in ratio, proves more effective for handling zoomed-in Regions of Interest (RoIs) compared to directly regressing 3D translation or (ox,oy,tz).
- Disentangled Loss: A loss function (LPose) that separately penalizes errors in rotation (LR, based on average L1 distance between transformed model points), normalized 2D center (Lcenter), and normalized depth (Lz) yields better results than coupled Point-Matching losses or simple angular/L1 losses on pose parameters. A symmetry-aware version ($L_{R,\text{sym}$) is used for symmetric objects.
- Geometry-Guided Direct Regression Network (GDR-Net):
- Architecture: Inspired by the state-of-the-art indirect method CDPN (1908.07363), GDR-Net takes a zoomed-in RoI as input and predicts intermediate geometric feature maps:
- Dense Correspondences Map (M2D-3D): Encodes dense 2D-3D correspondences by predicting normalized 3D object coordinates (MXYZ) for visible pixels and stacking them with their 2D pixel coordinates. This provides rich geometric shape information in an image-like structure.
- Surface Region Attention Map (MSRA): Predicts the probability distribution over predefined surface regions (derived via farthest point sampling on the 3D model) for each pixel. This acts as an ambiguity-aware attention mechanism, especially helpful for symmetric objects, and serves as an auxiliary task to aid MXYZ learning.
- Visible Object Mask (Mvis).
- Patch-PnP Module: Instead of using traditional PnP/RANSAC, GDR-Net introduces a simple, learnable 2D convolutional module ("Patch-PnP") that takes the M2D-3D and MSRA maps as input and directly regresses the 6D pose (parameterized as Ra6d and tSITE). This module exploits the image-like spatial structure of the dense correspondence maps, which PointNet-based approaches neglect. It consists of several convolutional layers followed by fully connected layers.
- End-to-End Training: The entire network, including the backbone predicting geometric maps and the Patch-PnP module, is trained end-to-end. The total loss is LGDR=LPose+LGeom, where LGeom supervises the intermediate geometric maps (L1 loss for MXYZ and Mvis, Cross-Entropy for MSRA).
Implementation Details:
- Input: Zoomed-in RoIs (e.g., 256×256) obtained from an off-the-shelf 2D object detector (e.g., Faster R-CNN, FCOS). Dynamic Zoom-In (DZI) augmentation is used during training to decouple detection and pose estimation training.
- Network: Based on CDPN architecture for feature extraction, modified to output MXYZ, MSRA, and Mvis. Patch-PnP uses 3x3 convolutions with stride 2, Group Normalization, and ReLU, followed by FC layers.
- Training: Uses Ranger optimizer with a cosine learning rate schedule. Trained end-to-end without complex multi-stage strategies.
- Pose Conversion: Predicted Ra6d is converted to a 3x3 rotation matrix R using Equation \ref{eq:r6_to_rot}. Predicted tSITE is converted to 3D translation t using bounding box info and camera intrinsics K (Equations \ref{eq:t_site} and \ref{eq:t_bp}).
Experiments and Results:
- Synthetic Sphere: Patch-PnP demonstrates superior robustness to noise and outliers compared to RANSAC-based EPnP and the learning-based PnP from Hu et al. (2003.13276).
- Ablation Study (LM dataset): Confirms the effectiveness of the chosen pose parameterizations (Ra6d, tSITE), the disentangled loss LPose, the Patch-PnP module (outperforming alternatives like PointNet-like PnP and BPnP (2003.13683)), and the geometric guidance (especially M2D-3D and MSRA). Notably, GDR-Net trained only with LPose (no explicit geometric map supervision) already performs competitively, highlighting the benefit of proper parameterization and loss.
- State-of-the-Art Comparison (LM, LM-O, YCB-V): GDR-Net achieves state-of-the-art results on these benchmarks, significantly outperforming previous direct and indirect methods without refinement steps. It even surpasses some refinement-based methods like DeepIM (1808.09548) and is competitive with CosyPose (2008.08465), while being faster. Results are strong both under standard protocols and the BOP benchmark (1808.09467) protocol.
- Runtime: Achieves real-time performance (e.g., ~35ms for 8 objects on a 2080Ti GPU, including detection).
Practical Implications:
GDR-Net provides a practical and effective way to perform monocular 6D object pose estimation.
- End-to-End Differentiability: Makes it suitable for integration into larger systems or for tasks like self-supervised learning where gradients through the pose estimator are needed.
- High Accuracy: Achieves state-of-the-art accuracy without requiring computationally expensive refinement steps or complex PnP/RANSAC solvers during inference.
- Real-time Speed: Efficient enough for real-world applications like robotics and AR.
- Implementation: The architecture builds upon existing structures (like CDPN) and introduces a relatively simple Patch-PnP module. The key implementation choices (pose parameterization, loss function) are clearly outlined. The use of Dynamic Zoom-In simplifies training by decoupling it from specific object detectors.
- Robustness: The learned Patch-PnP module shows better robustness to noise in the intermediate representations compared to traditional methods.
In summary, GDR-Net presents a unified framework that leverages geometric insights from indirect methods to guide a direct regression network, resulting in an accurate, fast, and end-to-end differentiable solution for 6D object pose estimation. The paper demonstrates the importance of careful parameterization and loss design in direct regression and introduces the effective Patch-PnP module for learning the pose estimation step from dense, structured geometric features.