- The paper introduces SemanticFusion, a system that combines ElasticFusion with CNNs to enhance 3D semantic mapping.
- It fuses semantic predictions using Bayesian updates and an optional CRF to refine per-pixel class probabilities across frames.
- Real-time experiments on office and NYUv2 datasets show significant accuracy improvements, demonstrating practical utility in robotics.
SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks
Introduction
This paper investigates the integration of semantic understanding into dense 3D mapping, pivotal for enhancing the intelligence of mobile robots. The authors propose a system named SemanticFusion, which combines Convolutional Neural Networks (CNNs) with ElasticFusion, a dense SLAM system. This fusion enables the creation of semantically annotated 3D maps by probabilistically merging semantic predictions from multiple viewpoints, thereby improving upon the limitations of single-frame segmentation methods.
Methodology
SemanticFusion is structured around three core modules:
- SLAM System (ElasticFusion):
- Provides dense correspondence across frames for indoor RGB-D sequences.
- Maintains globally consistent maps, accommodating loop closures with a deformation graph and surfel-based representation.
- Convolutional Neural Network (CNN):
- Utilizes an adapted version of Noh et al.'s architecture.
- Processes RGB-D inputs for semantic segmentation, exploiting learned features to improve per-pixel class predictions.
- Incremental Semantic Label Fusion:
- Uses Bayesian updates to refine surfel class probabilities by fusing information from multiple frames.
- Includes an optional Conditional Random Field (CRF) for spatial regularization, enhancing semantic coherence based on map geometry.
Experimental Evaluations
The approach was validated using two datasets: the NYUv2 dataset and a custom office reconstruction dataset. The SemanticFusion system demonstrated improved performance over single-frame CNN models. Notably, in the office dataset with varied viewpoints, significant accuracy gains were observed, emphasizing the advantage of integrating SLAM with semantic mapping.
For the NYUv2 dataset, SemanticFusion achieved improvements in class average accuracy, demonstrating efficacy even in less varied scanning trajectories. The use of a CRF provided slight enhancements to the overall prediction accuracy, though it was less impactful than the geometric fusion alone.
Results
- Quantitative Performance:
- On the office dataset, the SemanticFusion system improved class average accuracy from 43.6% to 48.3% for the RGB-D CNN and from 57.1% to 60.0% for Eigen's CNN.
- On the NYUv2 dataset, improvements were also seen: 58.9% to 63.2% for the Eigen CNN when enhanced by SemanticFusion.
- Real-Time Capability:
- Achieved interactive frame rates of approximately 25.3Hz, demonstrating practicality for real-time systems.
Implications and Future Directions
SemanticFusion represents a robust merging of SLAM and semantic segmentation, enabling the generation of informative 3D maps. These maps hold potential utility in applications such as autonomous navigation, object recognition, and human-robot interaction.
Future work could explore more sophisticated regularization methods, integration of object recognition for replacing surfels with explicit 3D models, and leveraging CNN compression techniques to enhance real-time performance on resource-constrained hardware. Further, dataset expansions with more diverse trajectories could elucidate additional performance gains achievable through this fusion strategy.
The comprehensive demonstration of SemanticFusion not only advances the domain of dense semantic mapping but also opens avenues for more intelligent robotic systems capable of contextual understanding and interaction.