(AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network (2102.04530v1)

Published 8 Feb 2021 in cs.CV, cs.AI, and cs.RO

Abstract: Autonomous robotic systems and self driving cars rely on accurate perception of their surroundings as the safety of the passengers and pedestrians is the top priority. Semantic segmentation is one the essential components of environmental perception that provides semantic information of the scene. Recently, several methods have been introduced for 3D LiDAR semantic segmentation. While, they can lead to improved performance, they are either afflicted by high computational complexity, therefore are inefficient, or lack fine details of smaller instances. To alleviate this problem, we propose AF2-S3Net, an end-to-end encoder-decoder CNN network for 3D LiDAR semantic segmentation. We present a novel multi-branch attentive feature fusion module in the encoder and a unique adaptive feature selection module with feature map re-weighting in the decoder. Our AF2-S3Net fuses the voxel based learning and point-based learning into a single framework to effectively process the large 3D scene. Our experimental results show that the proposed method outperforms the state-of-the-art approaches on the large-scale SemanticKITTI benchmark, ranking 1st on the competitive public leaderboard competition upon publication.

Citations (225)

View on Semantic Scholar

Summary

The paper introduces a novel encoder-decoder CNN integrating multi-branch attentive fusion and adaptive feature selection modules.
The methodology leverages voxel and point-based learning to achieve a 69.7% mIoU on the SemanticKITTI benchmark, outperforming previous approaches.
The approach significantly improves segmentation in sparse LiDAR data, paving the way for safer and more reliable autonomous navigation systems.

Attentive Feature Fusion for Sparse Semantic Segmentation Networks

The paper presents a novel approach to semantic segmentation for autonomous systems using LiDAR point clouds, with a specific focus on overcoming the computational inefficiencies and sparse data issues inherent in traditional methods. The proposed method, referred to as \method, leverages a combination of voxel-based and point-based learning within a unified framework. This approach introduces a multi-branch attentive feature fusion module within its encoder and an adaptive feature selection module within its decoder, effectively handling the large-scale 3D scenes typical in LiDAR tasks.

Methodology Overview

\method~is designed as an encoder-decoder Convolutional Neural Network (CNN) to perform sparse semantic segmentation. The encoder integrates a multi-branch attentive feature fusion module, enabling the capture of both global context and fine details. In the decoder, the adaptive feature selection module re-weights feature maps, enhancing the network’s ability to generalize over various environmental contexts. This design is supported by memory-efficient sparse convolution operations enabled by the Minkowski Engine framework, improving efficiency in processing high-sparsity data.

Experimental Results

Through experimentation on the SemanticKITTI benchmark, \method~demonstrates its superiority over existing state-of-the-art methodologies. The results reveal \method’s capability in achieving a mean Intersection over Union (mIoU) of 69.7%, significantly outperforming competitors such as SPVNAS and SalsaNext in key classes like bicycles, motorcycles, and pedestrians. This improvement is attributed primarily to the network's enhanced feature extraction and fusion process, alongside its ability to maintain rich context information across various scales.

Implications and Future Research Directions

The implications of this work extend into practical and theoretical realms. Practically, the incorporation of such efficient semantic segmentation models could lead to safer and more reliable autonomous navigation systems, capable of accurately interpreting and reacting to complex road scenarios in real-time. Theoretically, the application of attentive feature fusion mechanisms offers insights into the development of more robust network architectures, particularly for tasks involving large-scale, unstructured data.

Future advancements might explore the integration of this methodology with temporal data for dynamic scene understanding, or expand the model’s capabilities beyond segmentation to instance-level understanding, further enhancing the autonomous systems' interpretative and operational skills. Overall, the approach contributes significantly to the ongoing development of intelligent perception systems, offering a foundation for subsequent innovations in this domain.

PDF Markdown