MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion (2004.04336v1)

Published 9 Apr 2020 in cs.CV and cs.RO

Abstract: Robots and other smart devices need efficient object-based scene representations from their on-board vision systems to reason about contact, physics and occlusion. Recognized precise object models will play an important role alongside non-parametric reconstructions of unrecognized structures. We present a system which can estimate the accurate poses of multiple known objects in contact and occlusion from real-time, embodied multi-view vision. Our approach makes 3D object pose proposals from single RGB-D views, accumulates pose estimates and non-parametric occupancy information from multiple views as the camera moves, and performs joint optimization to estimate consistent, non-intersecting poses for multiple objects in contact. We verify the accuracy and robustness of our approach experimentally on 2 object datasets: YCB-Video, and our own challenging Cluttered YCB-Video. We demonstrate a real-time robotics application where a robot arm precisely and orderly disassembles complicated piles of objects, using only on-board RGB-D vision.

Citations (90)

View on Semantic Scholar

Summary

The paper introduces a novel volumetric fusion framework that integrates multi-view RGB-D data to address occlusion and contact-rich object scenarios.
It employs deep neural networks for initial pose prediction and differentiable collision detection to refine 6D pose estimates with physical plausibility.
Empirical results on YCB-Video datasets demonstrate superior ADD(-S) and ADD-S performance compared to DenseFusion baselines, advancing robotic manipulation tasks.

Multi-object Reasoning for 6D Pose Estimation through Volumetric Fusion

The paper "MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion" by Wada et al. contributes significantly to the area of 6D pose estimation by proposing a novel system adept at managing the complexities of object occlusion and contact within cluttered environments. This work centers on the development of a robust framework capable of estimating precise poses of known objects in real-time using embodied multi-view RGB-D vision. By utilizing volumetric fusion, this approach diverges from traditional techniques, which often falter in highly occluded scenarios and with contact-rich object arrangements.

System Design and Methodology

The MoreFusion system is distinguished by its integration of volumetric mapping within its pose estimation pipeline. The system architecture comprises four primary components:

Object-level Volumetric Fusion: The system leverages octree-based mapping methods to develop a volumetric occupancy map drawn from RGB-D inputs. This foundational step allows for capturing the occupied and free spaces, as well as modeling occlusions effectively.
Volumetric Pose Prediction: Emphasizing spatial awareness, the pose prediction module applies a deep neural network to process volumetric data in conjunction with masked RGB-D images. This dual input enables the system to generate initial pose predictions by considering the surrounding spatial configuration.
Collision-based Pose Refinement: Differentiable collision detection is employed for optimizing initial pose estimates. This method guarantees non-intersecting, physically plausible configurations by leveraging surround geometry in assessing the viability of pose hypothesizations.
CAD Model Alignment: Once consistent pose hypotheses are validated through multiple observations, detail-rich CAD models are integrated into the scene reconstruction, replacing the initial volumetric representations for enhanced precision.

Empirical Validation

The efficacy of the MoreFusion system is substantiated through rigorous experimentation using the YCB-Video and an additionally created Cluttered YCB-Video dataset. Notably, the proposed approach demonstrates superior performance in heavily congested environments, where traditional point-cloud methods underperform. The metrics, specifically ADD(-S) and ADD-S, illustrated through comparative analyses with DenseFusion baselines, highlight MoreFusion's robustness in both regular and occluded visibility scenarios.

Practical Implications and Future Directions

From a practical standpoint, MoreFusion's capability to facilitate complex robotic tasks, such as sorting and pick-and-place in mixed-environment settings, is highlighted through its successful application in robotic manipulation tasks. The system's ability to handle object symmetries and complex occlusions promises vast applicability in assembly and disassembly operations.

The paper also signals potential avenues for future research, suggesting the incorporation of physical dynamics within the framework to enrich its applicability for scenarios requiring reasoning over object interactions further. This could involve integrating physical simulation or real-time force feedback mechanisms to augment the visual pose estimation with mechanical understanding.

Conclusion

The MoreFusion approach stands as a pivotal advance in 3D object pose estimation by coupling mathematical rigor with practical applicability through a thoughtfully constructed fusion of multi-view, object-centric spatial analysis. By addressing intricate occlusions and maintaining real-time processing capabilities, the paper presents a significant stride towards achieving precise robotic manipulation in diverse environments. Future improvements leveraging physical simulation could further extend its applicability and enhance its accuracy across a broader range of tasks, thereby contributing substantively to both theoretical insights and applied robotics.

PDF Markdown

Related Papers

YouTube

Show All Videos