Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation (2102.12145v3)

Published 24 Feb 2021 in cs.CV and cs.RO

Abstract: 6D pose estimation from a single RGB image is a fundamental task in computer vision. The current top-performing deep learning-based methods rely on an indirect strategy, i.e., first establishing 2D-3D correspondences between the coordinates in the image plane and object coordinate system, and then applying a variant of the P$n$P/RANSAC algorithm. However, this two-stage pipeline is not end-to-end trainable, thus is hard to be employed for many tasks requiring differentiable poses. On the other hand, methods based on direct regression are currently inferior to geometry-based methods. In this work, we perform an in-depth investigation on both direct and indirect methods, and propose a simple yet effective Geometry-guided Direct Regression Network (GDR-Net) to learn the 6D pose in an end-to-end manner from dense correspondence-based intermediate geometric representations. Extensive experiments show that our approach remarkably outperforms state-of-the-art methods on LM, LM-O and YCB-V datasets. Code is available at https://git.io/GDR-Net.

Citations (304)

Summary

  • The paper introduces a novel end-to-end network that regresses 6D poses by guiding direct regression with intermediate geometric features.
  • It leverages continuous 6D rotation and scale-invariant translation parameterizations to overcome the limitations of traditional PnP/RANSAC methods.
  • The Patch-PnP module efficiently utilizes dense geometric maps to achieve real-time, robust performance on standard evaluation benchmarks.

This paper introduces GDR-Net (Geometry-guided Direct Regression Network), a novel approach for monocular 6D object pose estimation (predicting 3D rotation and translation from a single RGB image) that aims to combine the strengths of two dominant strategies: indirect (geometry-based) and direct (regression-based) methods.

Traditional indirect methods first establish 2D-3D correspondences between image pixels and points on the object's 3D model, then use algorithms like PnP (Perspective-n-Point) within a RANSAC scheme to solve for the pose. While accurate, these methods are typically not end-to-end differentiable due to the non-differentiable PnP/RANSAC step, limiting their use in tasks requiring differentiable poses (e.g., self-supervised learning) and potentially hindering optimal training as the correspondence loss is only a surrogate for the final pose error. Direct methods regress the 6D pose directly from image features, making them end-to-end trainable but generally less accurate than indirect methods.

GDR-Net proposes an end-to-end differentiable network that directly regresses the 6D pose but guides this regression using intermediate geometric representations similar to those used in indirect methods.

Key Contributions and Methodology:

  1. Revisiting Direct Regression: The authors first analyze direct regression methods and identify key factors for improving their performance:
    • Rotation Parameterization: Using a continuous 6-dimensional representation (R6dR_\text{6d}) derived from the first two columns of the rotation matrix, specifically in its allocentric form (viewpoint-invariant under translation), significantly outperforms discontinuous representations like quaternions or Lie algebra vectors.
    • Translation Parameterization: Employing a Scale-Invariant Translation Estimation (SITE) representation (tSITEt_\text{SITE}), which normalizes the projected 2D center offset (δx,δy)(\delta_x, \delta_y) by the bounding box dimensions and the depth estimate (δz\delta_z) by the zoom-in ratio, proves more effective for handling zoomed-in Regions of Interest (RoIs) compared to directly regressing 3D translation or (ox,oy,tz)(o_x, o_y, t_z).
    • Disentangled Loss: A loss function (LPoseL_\text{Pose}) that separately penalizes errors in rotation (LRL_R, based on average L1 distance between transformed model points), normalized 2D center (LcenterL_\text{center}), and normalized depth (LzL_z) yields better results than coupled Point-Matching losses or simple angular/L1 losses on pose parameters. A symmetry-aware version ($L_{R,\text{sym}$) is used for symmetric objects.
  2. Geometry-Guided Direct Regression Network (GDR-Net):
    • Architecture: Inspired by the state-of-the-art indirect method CDPN (1908.07363), GDR-Net takes a zoomed-in RoI as input and predicts intermediate geometric feature maps:
      • Dense Correspondences Map (M2D-3DM_\text{2D-3D}): Encodes dense 2D-3D correspondences by predicting normalized 3D object coordinates (MXYZM_\text{XYZ}) for visible pixels and stacking them with their 2D pixel coordinates. This provides rich geometric shape information in an image-like structure.
      • Surface Region Attention Map (MSRAM_\text{SRA}): Predicts the probability distribution over predefined surface regions (derived via farthest point sampling on the 3D model) for each pixel. This acts as an ambiguity-aware attention mechanism, especially helpful for symmetric objects, and serves as an auxiliary task to aid MXYZM_\text{XYZ} learning.
      • Visible Object Mask (MvisM_\text{vis}).
    • Patch-PnP Module: Instead of using traditional PnP/RANSAC, GDR-Net introduces a simple, learnable 2D convolutional module ("Patch-PnP") that takes the M2D-3DM_\text{2D-3D} and MSRAM_\text{SRA} maps as input and directly regresses the 6D pose (parameterized as Ra6dR_\text{a6d} and tSITEt_\text{SITE}). This module exploits the image-like spatial structure of the dense correspondence maps, which PointNet-based approaches neglect. It consists of several convolutional layers followed by fully connected layers.
    • End-to-End Training: The entire network, including the backbone predicting geometric maps and the Patch-PnP module, is trained end-to-end. The total loss is LGDR=LPose+LGeomL_\text{GDR} = L_\text{Pose} + L_\text{Geom}, where LGeomL_\text{Geom} supervises the intermediate geometric maps (L1 loss for MXYZM_\text{XYZ} and MvisM_\text{vis}, Cross-Entropy for MSRAM_\text{SRA}).

Implementation Details:

  • Input: Zoomed-in RoIs (e.g., 256×256256 \times 256) obtained from an off-the-shelf 2D object detector (e.g., Faster R-CNN, FCOS). Dynamic Zoom-In (DZI) augmentation is used during training to decouple detection and pose estimation training.
  • Network: Based on CDPN architecture for feature extraction, modified to output MXYZM_\text{XYZ}, MSRAM_\text{SRA}, and MvisM_\text{vis}. Patch-PnP uses 3x3 convolutions with stride 2, Group Normalization, and ReLU, followed by FC layers.
  • Training: Uses Ranger optimizer with a cosine learning rate schedule. Trained end-to-end without complex multi-stage strategies.
  • Pose Conversion: Predicted Ra6dR_\text{a6d} is converted to a 3x3 rotation matrix RR using Equation \ref{eq:r6_to_rot}. Predicted tSITEt_\text{SITE} is converted to 3D translation tt using bounding box info and camera intrinsics KK (Equations \ref{eq:t_site} and \ref{eq:t_bp}).

Experiments and Results:

  • Synthetic Sphere: Patch-PnP demonstrates superior robustness to noise and outliers compared to RANSAC-based EPnP and the learning-based PnP from Hu et al. (2003.13276).
  • Ablation Study (LM dataset): Confirms the effectiveness of the chosen pose parameterizations (Ra6dR_\text{a6d}, tSITEt_\text{SITE}), the disentangled loss LPoseL_\text{Pose}, the Patch-PnP module (outperforming alternatives like PointNet-like PnP and BPnP (2003.13683)), and the geometric guidance (especially M2D-3DM_\text{2D-3D} and MSRAM_\text{SRA}). Notably, GDR-Net trained only with LPoseL_\text{Pose} (no explicit geometric map supervision) already performs competitively, highlighting the benefit of proper parameterization and loss.
  • State-of-the-Art Comparison (LM, LM-O, YCB-V): GDR-Net achieves state-of-the-art results on these benchmarks, significantly outperforming previous direct and indirect methods without refinement steps. It even surpasses some refinement-based methods like DeepIM (1808.09548) and is competitive with CosyPose (2008.08465), while being faster. Results are strong both under standard protocols and the BOP benchmark (1808.09467) protocol.
  • Runtime: Achieves real-time performance (e.g., ~35ms for 8 objects on a 2080Ti GPU, including detection).

Practical Implications:

GDR-Net provides a practical and effective way to perform monocular 6D object pose estimation.

  • End-to-End Differentiability: Makes it suitable for integration into larger systems or for tasks like self-supervised learning where gradients through the pose estimator are needed.
  • High Accuracy: Achieves state-of-the-art accuracy without requiring computationally expensive refinement steps or complex PnP/RANSAC solvers during inference.
  • Real-time Speed: Efficient enough for real-world applications like robotics and AR.
  • Implementation: The architecture builds upon existing structures (like CDPN) and introduces a relatively simple Patch-PnP module. The key implementation choices (pose parameterization, loss function) are clearly outlined. The use of Dynamic Zoom-In simplifies training by decoupling it from specific object detectors.
  • Robustness: The learned Patch-PnP module shows better robustness to noise in the intermediate representations compared to traditional methods.

In summary, GDR-Net presents a unified framework that leverages geometric insights from indirect methods to guide a direct regression network, resulting in an accurate, fast, and end-to-end differentiable solution for 6D object pose estimation. The paper demonstrates the importance of careful parameterization and loss design in direct regression and introduces the effective Patch-PnP module for learning the pose estimation step from dense, structured geometric features.