Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator (2102.12145v5)

Published 24 Feb 2021 in cs.CV and cs.RO

Abstract: 6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp_bop2022.

Citations (304)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel end-to-end network that regresses 6D poses by guiding direct regression with intermediate geometric features.
  • It leverages continuous 6D rotation and scale-invariant translation parameterizations to overcome the limitations of traditional PnP/RANSAC methods.
  • The Patch-PnP module efficiently utilizes dense geometric maps to achieve real-time, robust performance on standard evaluation benchmarks.

This paper introduces GDR-Net (Geometry-guided Direct Regression Network), a novel approach for monocular 6D object pose estimation (predicting 3D rotation and translation from a single RGB image) that aims to combine the strengths of two dominant strategies: indirect (geometry-based) and direct (regression-based) methods.

Traditional indirect methods first establish 2D-3D correspondences between image pixels and points on the object's 3D model, then use algorithms like PnP (Perspective-n-Point) within a RANSAC scheme to solve for the pose. While accurate, these methods are typically not end-to-end differentiable due to the non-differentiable PnP/RANSAC step, limiting their use in tasks requiring differentiable poses (e.g., self-supervised learning) and potentially hindering optimal training as the correspondence loss is only a surrogate for the final pose error. Direct methods regress the 6D pose directly from image features, making them end-to-end trainable but generally less accurate than indirect methods.

GDR-Net proposes an end-to-end differentiable network that directly regresses the 6D pose but guides this regression using intermediate geometric representations similar to those used in indirect methods.

Key Contributions and Methodology:

  1. Revisiting Direct Regression: The authors first analyze direct regression methods and identify key factors for improving their performance:
    • Rotation Parameterization: Using a continuous 6-dimensional representation (R6dR_\text{6d}) derived from the first two columns of the rotation matrix, specifically in its allocentric form (viewpoint-invariant under translation), significantly outperforms discontinuous representations like quaternions or Lie algebra vectors.
    • Translation Parameterization: Employing a Scale-Invariant Translation Estimation (SITE) representation (tSITEt_\text{SITE}), which normalizes the projected 2D center offset (δx,δy)(\delta_x, \delta_y) by the bounding box dimensions and the depth estimate (δz\delta_z) by the zoom-in ratio, proves more effective for handling zoomed-in Regions of Interest (RoIs) compared to directly regressing 3D translation or (ox,oy,tz)(o_x, o_y, t_z).
    • Disentangled Loss: A loss function (LPoseL_\text{Pose}) that separately penalizes errors in rotation (LRL_R, based on average L1 distance between transformed model points), normalized 2D center (LcenterL_\text{center}), and normalized depth (LzL_z) yields better results than coupled Point-Matching losses or simple angular/L1 losses on pose parameters. A symmetry-aware version ($L_{R,\text{sym}$) is used for symmetric objects.
  2. Geometry-Guided Direct Regression Network (GDR-Net):
    • Architecture: Inspired by the state-of-the-art indirect method CDPN (Chen et al., 2019), GDR-Net takes a zoomed-in RoI as input and predicts intermediate geometric feature maps:
      • Dense Correspondences Map (M2D-3DM_\text{2D-3D}): Encodes dense 2D-3D correspondences by predicting normalized 3D object coordinates (MXYZM_\text{XYZ}) for visible pixels and stacking them with their 2D pixel coordinates. This provides rich geometric shape information in an image-like structure.
      • Surface Region Attention Map (MSRAM_\text{SRA}): Predicts the probability distribution over predefined surface regions (derived via farthest point sampling on the 3D model) for each pixel. This acts as an ambiguity-aware attention mechanism, especially helpful for symmetric objects, and serves as an auxiliary task to aid MXYZM_\text{XYZ} learning.
      • Visible Object Mask (MvisM_\text{vis}).
    • Patch-PnP Module: Instead of using traditional PnP/RANSAC, GDR-Net introduces a simple, learnable 2D convolutional module ("Patch-PnP") that takes the M2D-3DM_\text{2D-3D} and MSRAM_\text{SRA} maps as input and directly regresses the 6D pose (parameterized as Ra6dR_\text{a6d} and tSITEt_\text{SITE}). This module exploits the image-like spatial structure of the dense correspondence maps, which PointNet-based approaches neglect. It consists of several convolutional layers followed by fully connected layers.
    • End-to-End Training: The entire network, including the backbone predicting geometric maps and the Patch-PnP module, is trained end-to-end. The total loss is LGDR=LPose+LGeomL_\text{GDR} = L_\text{Pose} + L_\text{Geom}, where LGeomL_\text{Geom} supervises the intermediate geometric maps (L1 loss for MXYZM_\text{XYZ} and MvisM_\text{vis}, Cross-Entropy for MSRAM_\text{SRA}).

Implementation Details:

  • Input: Zoomed-in RoIs (e.g., 256×256256 \times 256) obtained from an off-the-shelf 2D object detector (e.g., Faster R-CNN, FCOS). Dynamic Zoom-In (DZI) augmentation is used during training to decouple detection and pose estimation training.
  • Network: Based on CDPN architecture for feature extraction, modified to output MXYZM_\text{XYZ}, MSRAM_\text{SRA}, and MvisM_\text{vis}. Patch-PnP uses 3x3 convolutions with stride 2, Group Normalization, and ReLU, followed by FC layers.
  • Training: Uses Ranger optimizer with a cosine learning rate schedule. Trained end-to-end without complex multi-stage strategies.
  • Pose Conversion: Predicted Ra6dR_\text{a6d} is converted to a 3x3 rotation matrix RR using Equation \ref{eq:r6_to_rot}. Predicted tSITEt_\text{SITE} is converted to 3D translation tt using bounding box info and camera intrinsics KK (Equations \ref{eq:t_site} and \ref{eq:t_bp}).

Experiments and Results:

  • Synthetic Sphere: Patch-PnP demonstrates superior robustness to noise and outliers compared to RANSAC-based EPnP and the learning-based PnP from Hu et al. (Dong, 2020).
  • Ablation Study (LM dataset): Confirms the effectiveness of the chosen pose parameterizations (Ra6dR_\text{a6d}, tSITEt_\text{SITE}), the disentangled loss LPoseL_\text{Pose}, the Patch-PnP module (outperforming alternatives like PointNet-like PnP and BPnP (Li et al., 2020)), and the geometric guidance (especially M2D-3DM_\text{2D-3D} and MSRAM_\text{SRA}). Notably, GDR-Net trained only with LPoseL_\text{Pose} (no explicit geometric map supervision) already performs competitively, highlighting the benefit of proper parameterization and loss.
  • State-of-the-Art Comparison (LM, LM-O, YCB-V): GDR-Net achieves state-of-the-art results on these benchmarks, significantly outperforming previous direct and indirect methods without refinement steps. It even surpasses some refinement-based methods like DeepIM (Guo et al., 2018) and is competitive with CosyPose (Labbé et al., 2020), while being faster. Results are strong both under standard protocols and the BOP benchmark (Roederer et al., 2018) protocol.
  • Runtime: Achieves real-time performance (e.g., ~35ms for 8 objects on a 2080Ti GPU, including detection).

Practical Implications:

GDR-Net provides a practical and effective way to perform monocular 6D object pose estimation.

  • End-to-End Differentiability: Makes it suitable for integration into larger systems or for tasks like self-supervised learning where gradients through the pose estimator are needed.
  • High Accuracy: Achieves state-of-the-art accuracy without requiring computationally expensive refinement steps or complex PnP/RANSAC solvers during inference.
  • Real-time Speed: Efficient enough for real-world applications like robotics and AR.
  • Implementation: The architecture builds upon existing structures (like CDPN) and introduces a relatively simple Patch-PnP module. The key implementation choices (pose parameterization, loss function) are clearly outlined. The use of Dynamic Zoom-In simplifies training by decoupling it from specific object detectors.
  • Robustness: The learned Patch-PnP module shows better robustness to noise in the intermediate representations compared to traditional methods.

In summary, GDR-Net presents a unified framework that leverages geometric insights from indirect methods to guide a direct regression network, resulting in an accurate, fast, and end-to-end differentiable solution for 6D object pose estimation. The paper demonstrates the importance of careful parameterization and loss design in direct regression and introduces the effective Patch-PnP module for learning the pose estimation step from dense, structured geometric features.