DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion (1901.04780v1)

Published 15 Jan 2019 in cs.CV and cs.RO

Abstract: A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose.

Citations (877)

View on Semantic Scholar

Summary

The paper introduces DenseFusion, an end-to-end framework fusing RGB and depth features at a pixel level to enhance 6D object pose estimation accuracy and robustness.
It integrates an iterative pose refinement mechanism to eliminate costly post-processing, enabling real-time performance in challenging scenarios.
Empirical results on YCB-Video and LineMOD benchmarks show a 3.5% accuracy improvement and 200x faster inference compared to prior methods.

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

The paper "DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion" authored by Wang et al. introduces a novel approach to the long-standing problem of 6D object pose estimation using RGB-D images. This research is motivated by the necessity to seamlessly integrate color (RGB) and depth (D) information to achieve accurate, robust, and real-time performance in pose estimation tasks, which are critical in applications such as robotic grasping, autonomous navigation, and augmented reality.

Key Contributions

The paper presents several key contributions:

DenseFusion Framework: Unlike prior methods that either process RGB and depth data separately or rely extensively on post-processing steps, DenseFusion proposes a robust end-to-end architecture. This architecture individually processes RGB and depth data to extract dense feature embeddings and subsequently fuses them at a pixel-wise level. The fusion at this granular level allows for the effective handling of occlusions.
Iterative Pose Refinement: DenseFusion integrates an iterative pose refinement mechanism within the network. This distinguishes it from methods that require separate, often costly post-processing steps such as ICP. The refinement is achieved end-to-end, enhancing both performance and speed, bringing it closer to real-time application.
Performance on Benchmark Datasets: The framework's efficacy is demonstrated on two challenging benchmarks, YCB-Video and LineMOD datasets. DenseFusion achieves superior performance with enhancements in pose accuracy by 3.5% over PoseCNN+ICP on the YCB-Video dataset while maintaining a speed 200 times faster.

Detailed Analysis

Dense Fusion Architecture

DenseFusion operates in two main stages:

Feature Extraction: The first stage processes the RGB image through a fully convolutional network to produce dense color features. Simultaneously, the depth data, converted to a 3D point cloud, is processed using a PointNet-based network to extract geometric features. This respects the inherent structures of both data types and avoids the limitations of treating depth data merely as a supplementary image channel.
Pixel-wise Fusion Network: The RGB and geometric features are fused at a pixel level. This local fusion allows predictions based on explicit reasoning about the local appearance and geometry. Moreover, a global fusion step enriches the local features with broader context.

Pose Estimation and Refinement

DenseFusion's pose predictor generates a pose estimate for each pixel, which are then aggregated to form the final pose. The network also outputs a self-supervised confidence score for each estimate, guiding the selection of the most reliable pose hypothesis.

The iterative refinement mechanism further polishes the pose estimates. By treating the estimated pose as an intermediate canonical frame, the network transforms the input data iteratively, refining the pose to progressively reduce errors.

Empirical Validation

The experiments thoroughly validate the efficiency of DenseFusion. On the YCB-Video dataset, DenseFusion outperforms PointFusion and PoseCNN+ICP, emphasizing significant improvements in scenarios with heavy occlusion. DenseFusion remains robust, with only minimal performance degradation under increased occlusions, unlike its predecessors that exhibit substantial accuracy drops.

For the LineMOD dataset, DenseFusion again leads with a higher mean accuracy, recording an 86.2% ADD (Average Distance of Model Points) which surpasses the state-of-the-art methods even without iterative refinement. With iterative refinement, DenseFusion further pushes this boundary, confirming the substantial benefits of its approach.

Practical Implications and Future Directions

DenseFusion's real-time inference capabilities hold promise for practical applications in robotics, where quick and accurate pose estimation is crucial. In particular, robotic manipulation tasks leveraging DenseFusion's accurate pose outputs have shown promising results even in cluttered and occluded environments.

The research suggests potential future directions such as integrating more sophisticated geometric reasoning and leveraging additional data modalities for even richer feature representations. Moreover, exploration into more tightly coupled segmentation and pose estimation could yield further performance gains, ensuring greater robustness and generalizability.

In conclusion, DenseFusion contributes a significant advancement in the field of 6D object pose estimation by effectively leveraging and fusing RGB-D information. Its practical efficiency and empirical robustness underline its potential for deployment in real-world applications, promising a significant step forward for AI-driven robotic and augmented reality systems.

PDF Markdown