Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges (1902.07830v4)

Published 21 Feb 2019 in cs.RO

Abstract: Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/.

Citations (895)

View on Semantic Scholar

Summary

The paper introduces current datasets and innovative fusion methods that combine camera, LiDAR, Radar, and other sensor data for improved autonomous driving perception.
It outlines key challenges such as limited data diversity, optimal fusion strategy selection, and effective uncertainty estimation in sensor data.
The research emphasizes the need for systematic approaches and future advances to achieve robust, real-time object detection and semantic segmentation in autonomous vehicles.

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

The paper, titled "Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges" by Di Feng et al., systematically reviews advancements in multi-modal perception with a focus on applications to autonomous driving. This review provides an in-depth overview of current datasets, introduces various methods for data fusion, and discusses inherent challenges in the field.

Background and Overview

Multi-modal perception for autonomous driving leverages a combination of sensors such as cameras, LiDARs, Radars, and GPS. The motivation behind using multiple sensors is the complementary nature of the data they provide. For instance, cameras capture rich texture information but struggle in low-light conditions, whereas LiDARs provide accurate depth information that is robust to lighting changes but lacks fine spatial resolution. Autonomous vehicles (AVs) need to understand their surroundings accurately, robustly, and in real-time to ensure safe and reliable operation in diverse and complex driving environments.

Datasets

The robustness and accuracy of deep learning algorithms largely depend on the availability and diversity of training datasets. The paper acknowledges the vast amount of data required for training such systems and the challenges in obtaining high-quality, diverse labeled datasets. Several datasets are discussed, including:

KITTI: Widely used but limited in size and scope.
nuScenes: Provides comprehensive data with cameras, LiDARs, and Radars.
KAIST: Combines visual and thermal images with LiDAR data.
Waymo Open Dataset: Offers extensive annotated data for robust training.

These datasets are evaluated based on their sensor modalities, geographic diversity, the variety of recorded scenes, and labeling completeness. Data augmentation through simulation is also highlighted as a means to address these limitations, emphasizing the importance of generating diverse driving scenarios using virtual datasets.

Methods

What to Fuse

The discussion revolves around how to represent and process various sensing modalities effectively. For instance, LiDAR data can be represented in 3D voxels or projected onto 2D feature maps in the bird’s eye view (BEV) or spherical coordinates. Camera images, predominantly in RGB format, are also explored in different perspectives like monocular depth estimation. The paper assesses how these representations affect fusion techniques and, consequently, the performance of multi-modal perception systems.

How to Fuse

Fusion operations are crucial in combining sensor data. The paper categorizes fusion techniques into:

Addition or Average Mean: Simple element-wise operations.
Concatenation: Stacking feature maps along the depth dimension.
Ensemble: Combining outputs from different domain-specific networks.
Mixture of Experts (MoE): Weighted averaging based on the informativeness of each modality.

When to Fuse

The stage at which data from different sensors is fused within a neural network (CNN) plays a vital role. The paper divides fusion schemes into:

Early Fusion: At input layer, allowing the network to learn joint features from raw data.
Late Fusion: At the decision layer, combining outputs of modality-specific networks.
Middle Fusion: At intermediate layers, allowing a hierarchical combination of features.

Different fusion schemes are evaluated in terms of computational efficiency, flexibility, and robustness.

Challenges and Open Questions

Data Preparation

The limited size and diversity of training datasets pose significant challenges. Ensuring comprehensive coverage of different driving scenarios, weather conditions, and object classes is essential. Labeling efficiency through active learning, transfer learning, and semi-supervised techniques is also recognized as a crucial area for improvement.

Fusing Radars and Other Modalities

The fusion of data from under-utilized sensors like Radar and Ultrasonic is an open field of research. Integrating these modalities promises enhanced robustness, especially in adverse weather conditions where traditional sensors might struggle.

Uncertainty Estimation

Effective uncertainty quantification is pivotal for safe autonomous operations. The paper underlines the need for frameworks to propagate sensor uncertainties through to decision-making modules. Bayesian Neural Networks (BNNs) are suggested as a viable approach for uncertainty estimation in multi-modal perception systems.

Best Practices for Fusion Strategies

Designing optimal fusion architectures is often empirical. The paper calls for more systematic approaches, possibly through neural architecture search and visual analytics tools, to discover the most effective fusion strategies.

Future Directions

The future of multi-modal perception lies in several promising directions:

Continual Learning: Developing methods for lifelong learning to continuously update models with new data.
Generative Models for Diverse Data: Utilizing approaches like GANs to generate varied and realistic training datasets.
Comprehensive Evaluation Metrics: Creating metrics that go beyond accuracy to evaluate robustness and uncertainty effectively.

Conclusion

This review highlights significant progress and continuing challenges in deep multi-modal perception for autonomous driving. It serves as a comprehensive guide for researchers and practitioners looking to leverage multiple sensor modalities to enhance scene understanding in autonomous vehicles. Future advancements in dataset diversity, fusion methodologies, and robust evaluation frameworks will be pivotal for realizing the full potential of autonomous driving technologies.

PDF Markdown

Related Papers

GitHub

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges | Di Feng, Christian Haase-Schuetz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wiesbeck and Klaus Dietmayer <p> Robert Bosch GmbH in cooperation with Ulm University and Karlruhe Institute of Technology <p> * Contributed equally