GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation (2102.12145v3)

Published 24 Feb 2021 in cs.CV and cs.RO

Abstract: 6D pose estimation from a single RGB image is a fundamental task in computer vision. The current top-performing deep learning-based methods rely on an indirect strategy, i.e., first establishing 2D-3D correspondences between the coordinates in the image plane and object coordinate system, and then applying a variant of the P$n$P/RANSAC algorithm. However, this two-stage pipeline is not end-to-end trainable, thus is hard to be employed for many tasks requiring differentiable poses. On the other hand, methods based on direct regression are currently inferior to geometry-based methods. In this work, we perform an in-depth investigation on both direct and indirect methods, and propose a simple yet effective Geometry-guided Direct Regression Network (GDR-Net) to learn the 6D pose in an end-to-end manner from dense correspondence-based intermediate geometric representations. Extensive experiments show that our approach remarkably outperforms state-of-the-art methods on LM, LM-O and YCB-V datasets. Code is available at https://git.io/GDR-Net.

Citations (304)

View on Semantic Scholar

Summary

The paper introduces a novel end-to-end network that regresses 6D poses by guiding direct regression with intermediate geometric features.
It leverages continuous 6D rotation and scale-invariant translation parameterizations to overcome the limitations of traditional PnP/RANSAC methods.
The Patch-PnP module efficiently utilizes dense geometric maps to achieve real-time, robust performance on standard evaluation benchmarks.

This paper introduces GDR-Net (Geometry-guided Direct Regression Network), a novel approach for monocular 6D object pose estimation (predicting 3D rotation and translation from a single RGB image) that aims to combine the strengths of two dominant strategies: indirect (geometry-based) and direct (regression-based) methods.

Traditional indirect methods first establish 2D-3D correspondences between image pixels and points on the object's 3D model, then use algorithms like PnP (Perspective-n-Point) within a RANSAC scheme to solve for the pose. While accurate, these methods are typically not end-to-end differentiable due to the non-differentiable PnP/RANSAC step, limiting their use in tasks requiring differentiable poses (e.g., self-supervised learning) and potentially hindering optimal training as the correspondence loss is only a surrogate for the final pose error. Direct methods regress the 6D pose directly from image features, making them end-to-end trainable but generally less accurate than indirect methods.

GDR-Net proposes an end-to-end differentiable network that directly regresses the 6D pose but guides this regression using intermediate geometric representations similar to those used in indirect methods.

Key Contributions and Methodology:

Revisiting Direct Regression: The authors first analyze direct regression methods and identify key factors for improving their performance:
- Rotation Parameterization: Using a continuous 6-dimensional representation ( $R_\text{6d}$ ) derived from the first two columns of the rotation matrix, specifically in its allocentric form (viewpoint-invariant under translation), significantly outperforms discontinuous representations like quaternions or Lie algebra vectors.
- Translation Parameterization: Employing a Scale-Invariant Translation Estimation (SITE) representation ( $t_\text{SITE}$ ), which normalizes the projected 2D center offset $(\delta_x, \delta_y)$ by the bounding box dimensions and the depth estimate ( $\delta_z$ ) by the zoom-in ratio, proves more effective for handling zoomed-in Regions of Interest (RoIs) compared to directly regressing 3D translation or $(o_x, o_y, t_z)$ .
- Disentangled Loss: A loss function ( $L_\text{Pose}$ ) that separately penalizes errors in rotation ( $L_R$ , based on average L1 distance between transformed model points), normalized 2D center ( $L_\text{center}$ ), and normalized depth ( $L_z$ ) yields better results than coupled Point-Matching losses or simple angular/L1 losses on pose parameters. A symmetry-aware version ($L_{R,\text{sym}$) is used for symmetric objects.
Geometry-Guided Direct Regression Network (GDR-Net):
- Architecture: Inspired by the state-of-the-art indirect method CDPN (1908.07363), GDR-Net takes a zoomed-in RoI as input and predicts intermediate geometric feature maps:
  - Dense Correspondences Map ( $M_\text{2D-3D}$ ): Encodes dense 2D-3D correspondences by predicting normalized 3D object coordinates ( $M_\text{XYZ}$ ) for visible pixels and stacking them with their 2D pixel coordinates. This provides rich geometric shape information in an image-like structure.
  - Surface Region Attention Map ( $M_\text{SRA}$ ): Predicts the probability distribution over predefined surface regions (derived via farthest point sampling on the 3D model) for each pixel. This acts as an ambiguity-aware attention mechanism, especially helpful for symmetric objects, and serves as an auxiliary task to aid $M_\text{XYZ}$ learning.
  - Visible Object Mask ( $M_\text{vis}$ ).
- Patch-PnP Module: Instead of using traditional PnP/RANSAC, GDR-Net introduces a simple, learnable 2D convolutional module ("Patch-PnP") that takes the $M_\text{2D-3D}$ and $M_\text{SRA}$ maps as input and directly regresses the 6D pose (parameterized as $R_\text{a6d}$ and $t_\text{SITE}$ ). This module exploits the image-like spatial structure of the dense correspondence maps, which PointNet-based approaches neglect. It consists of several convolutional layers followed by fully connected layers.
- End-to-End Training: The entire network, including the backbone predicting geometric maps and the Patch-PnP module, is trained end-to-end. The total loss is $L_\text{GDR} = L_\text{Pose} + L_\text{Geom}$ , where $L_\text{Geom}$ supervises the intermediate geometric maps (L1 loss for $M_\text{XYZ}$ and $M_\text{vis}$ , Cross-Entropy for $M_\text{SRA}$ ).

Implementation Details:

Input: Zoomed-in RoIs (e.g., $256 \times 256$ ) obtained from an off-the-shelf 2D object detector (e.g., Faster R-CNN, FCOS). Dynamic Zoom-In (DZI) augmentation is used during training to decouple detection and pose estimation training.
Network: Based on CDPN architecture for feature extraction, modified to output $M_\text{XYZ}$ , $M_\text{SRA}$ , and $M_\text{vis}$ . Patch-PnP uses 3x3 convolutions with stride 2, Group Normalization, and ReLU, followed by FC layers.
Training: Uses Ranger optimizer with a cosine learning rate schedule. Trained end-to-end without complex multi-stage strategies.
Pose Conversion: Predicted $R_\text{a6d}$ is converted to a 3x3 rotation matrix $R$ using Equation \ref{eq:r6_to_rot}. Predicted $t_\text{SITE}$ is converted to 3D translation $t$ using bounding box info and camera intrinsics $K$ (Equations \ref{eq:t_site} and \ref{eq:t_bp}).

Experiments and Results:

Synthetic Sphere: Patch-PnP demonstrates superior robustness to noise and outliers compared to RANSAC-based EPnP and the learning-based PnP from Hu et al. (2003.13276).
Ablation Study (LM dataset): Confirms the effectiveness of the chosen pose parameterizations ( $R_\text{a6d}$ , $t_\text{SITE}$ ), the disentangled loss $L_\text{Pose}$ , the Patch-PnP module (outperforming alternatives like PointNet-like PnP and BPnP (2003.13683)), and the geometric guidance (especially $M_\text{2D-3D}$ and $M_\text{SRA}$ ). Notably, GDR-Net trained only with $L_\text{Pose}$ (no explicit geometric map supervision) already performs competitively, highlighting the benefit of proper parameterization and loss.
State-of-the-Art Comparison (LM, LM-O, YCB-V): GDR-Net achieves state-of-the-art results on these benchmarks, significantly outperforming previous direct and indirect methods without refinement steps. It even surpasses some refinement-based methods like DeepIM (1808.09548) and is competitive with CosyPose (2008.08465), while being faster. Results are strong both under standard protocols and the BOP benchmark (1808.09467) protocol.
Runtime: Achieves real-time performance (e.g., ~35ms for 8 objects on a 2080Ti GPU, including detection).

Practical Implications:

GDR-Net provides a practical and effective way to perform monocular 6D object pose estimation.

End-to-End Differentiability: Makes it suitable for integration into larger systems or for tasks like self-supervised learning where gradients through the pose estimator are needed.
High Accuracy: Achieves state-of-the-art accuracy without requiring computationally expensive refinement steps or complex PnP/RANSAC solvers during inference.
Real-time Speed: Efficient enough for real-world applications like robotics and AR.
Implementation: The architecture builds upon existing structures (like CDPN) and introduces a relatively simple Patch-PnP module. The key implementation choices (pose parameterization, loss function) are clearly outlined. The use of Dynamic Zoom-In simplifies training by decoupling it from specific object detectors.
Robustness: The learned Patch-PnP module shows better robustness to noise in the intermediate representations compared to traditional methods.

In summary, GDR-Net presents a unified framework that leverages geometric insights from indirect methods to guide a direct regression network, resulting in an accurate, fast, and end-to-end differentiable solution for 6D object pose estimation. The paper demonstrates the importance of careful parameterization and loss design in direct regression and introduces the effective Patch-PnP module for learning the pose estimation step from dense, structured geometric features.

PDF Markdown