- The paper presents a dual-decoder architecture that combines explicit and implicit predictions to refine 6D object pose and size estimates.
- It leverages spherical convolutions and a self-adaptive loss to enforce pose consistency from single-view RGB-D inputs.
- Extensive experiments on datasets like REAL275 demonstrate superior performance with significant mAP improvements over existing methods.
 
 
      DualPoseNet: Category-level 6D Object Pose and Size Estimation
The paper presents DualPoseNet, an innovative approach for category-level 6D object pose and size estimation, using a novel architecture that involves a dual pose network with a refined learning process to ensure pose consistency. The core problem addressed is the estimation of full pose configurations—including rotation, translation, and size—for object instances observed from single arbitrary views in cluttered scenes, crucial for applications in augmented reality, robotics, and autonomous vehicles.
Methodology
Dual Pose Network Architecture
DualPoseNet integrates two pose decoders—a major technical innovation. The architecture consists of a shared pose encoder, built upon spherical convolutions for learning pose-sensitive features, and two distinct decoders:
- Explicit Decoder: This directly predicts the rotation, translation, and size of the objects using an MLP.
- Implicit Decoder: It reconstructs the canonical pose of the input point cloud, focusing on the consistent prediction of pose through a self-adaptive loss term. This implicit decoder aims to further refine pose predictions during testing by enforcing consistency with the explicit decoder, despite the absence of testing CAD models.
Spherical Convolutions and Fusion
The use of spherical convolutions ensures rotation equivariance, thereby effectively capturing pose-sensitive shape features from the RGB-D data. A Spherical Fusion module is embedded within the encoder, facilitating the integration of features derived from appearance and shape observations, specifically tuned to enhance the encoder's learning capabilities.
Results
Extensive experiments on both category- and instance-level object pose datasets were performed, including CAMERA25 and REAL275 for category-level tests, and YCB-Video and LineMOD for instance-level assessments. DualPoseNet demonstrates superior performance over existing methods, particularly in high precision metrics such as IoU50 and IoU75.
Numerical Highlights:
- On REAL275, DualPoseNet achieved an outstanding mAP of 44.5% at IoU50, 10° error threshold, and 10% scale error, significantly outperforming existing benchmarks.
- Noteworthy improvements are also seen on benchmark synthetic datasets; for example, DualPoseNet outperformed prior methods with mAPs peaking at 86.4% when considering IoU75 and translation/rotation thresholds.
Practical and Theoretical Implications
The introduction of a dual pose estimation mechanism represents an important advance, particularly in scenarios lacking CAD models for refinement post-processing. The method’s inclusion of self-adaptive loss to enforce consistency between the two decoders suggests a promising direction for optimizing pose predictions.
For real-world application, the ability to infer precise 6D poses without reliance on CAD models is pivotal for scalability and practical deployment, particularly in domains requiring rapid and adaptable object detection solutions.
Future Developments
The research paves the way for further advancements in 6D pose estimation by exploring alternatives to spherical convolutions and exploring broader applications of the proposed architecture in more diverse and complex environments. An interesting avenue for future exploration lies in extending the method to monocular inputs exclusively, potentially increasing its utility across a wider range of real-world applications.
Overall, DualPoseNet embodies a significant step forward in category-level 6D object pose estimation, offering both practical benefits and a solid foundation for ongoing research.