SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks (1609.05130v2)

Published 16 Sep 2016 in cs.CV

Abstract: Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need extend beyond geometry and appearence - they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state of the art dense Simultaneous Localisation and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondence between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of approximately 25Hz.

Citations (534)

View on Semantic Scholar

Summary

The paper introduces SemanticFusion, a system that combines ElasticFusion with CNNs to enhance 3D semantic mapping.
It fuses semantic predictions using Bayesian updates and an optional CRF to refine per-pixel class probabilities across frames.
Real-time experiments on office and NYUv2 datasets show significant accuracy improvements, demonstrating practical utility in robotics.

SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks

Introduction

This paper investigates the integration of semantic understanding into dense 3D mapping, pivotal for enhancing the intelligence of mobile robots. The authors propose a system named SemanticFusion, which combines Convolutional Neural Networks (CNNs) with ElasticFusion, a dense SLAM system. This fusion enables the creation of semantically annotated 3D maps by probabilistically merging semantic predictions from multiple viewpoints, thereby improving upon the limitations of single-frame segmentation methods.

Methodology

SemanticFusion is structured around three core modules:

SLAM System (ElasticFusion):
- Provides dense correspondence across frames for indoor RGB-D sequences.
- Maintains globally consistent maps, accommodating loop closures with a deformation graph and surfel-based representation.
Convolutional Neural Network (CNN):
- Utilizes an adapted version of Noh et al.'s architecture.
- Processes RGB-D inputs for semantic segmentation, exploiting learned features to improve per-pixel class predictions.
Incremental Semantic Label Fusion:
- Uses Bayesian updates to refine surfel class probabilities by fusing information from multiple frames.
- Includes an optional Conditional Random Field (CRF) for spatial regularization, enhancing semantic coherence based on map geometry.

Experimental Evaluations

The approach was validated using two datasets: the NYUv2 dataset and a custom office reconstruction dataset. The SemanticFusion system demonstrated improved performance over single-frame CNN models. Notably, in the office dataset with varied viewpoints, significant accuracy gains were observed, emphasizing the advantage of integrating SLAM with semantic mapping.

For the NYUv2 dataset, SemanticFusion achieved improvements in class average accuracy, demonstrating efficacy even in less varied scanning trajectories. The use of a CRF provided slight enhancements to the overall prediction accuracy, though it was less impactful than the geometric fusion alone.

Results

Quantitative Performance:
- On the office dataset, the SemanticFusion system improved class average accuracy from 43.6% to 48.3% for the RGB-D CNN and from 57.1% to 60.0% for Eigen's CNN.
- On the NYUv2 dataset, improvements were also seen: 58.9% to 63.2% for the Eigen CNN when enhanced by SemanticFusion.
Real-Time Capability:
- Achieved interactive frame rates of approximately 25.3Hz, demonstrating practicality for real-time systems.

Implications and Future Directions

SemanticFusion represents a robust merging of SLAM and semantic segmentation, enabling the generation of informative 3D maps. These maps hold potential utility in applications such as autonomous navigation, object recognition, and human-robot interaction.

Future work could explore more sophisticated regularization methods, integration of object recognition for replacing surfels with explicit 3D models, and leveraging CNN compression techniques to enhance real-time performance on resource-constrained hardware. Further, dataset expansions with more diverse trajectories could elucidate additional performance gains achievable through this fusion strategy.

The comprehensive demonstration of SemanticFusion not only advances the domain of dense semantic mapping but also opens avenues for more intelligent robotic systems capable of contextual understanding and interaction.

PDF Markdown

Related Papers

YouTube

Show All Videos