Multi-View 3D Object Detection Network for Autonomous Driving

Published 23 Nov 2016 in cs.CV | (1611.07759v3)

Abstract: This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmark show that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 10.3% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (2,609)

View on Semantic Scholar

Summary

The paper introduces a novel multi-view fusion method that combines LiDAR point clouds and RGB images to produce accurate 3D bounding box proposals.
It utilizes a 3D Proposal Network and Region-based Fusion Network to integrate bird's eye view, front view, and image features efficiently.
Experiments on the KITTI benchmark reveal significant recall and precision improvements, boosting both 3D localization and detection performance.

Multi-View 3D Object Detection Network for Autonomous Driving

The paper "Multi-View 3D Object Detection Network for Autonomous Driving" by Xiaozhi Chen et al. presents an innovative sensory-fusion framework, named MV3D, designed to enhance 3D object detection accuracy in autonomous driving scenarios. This paper introduces a noteworthy approach that leverages both LIDAR point cloud data and RGB images to predict oriented 3D bounding boxes. The proposed network consists of two primary subnetworks: a 3D Proposal Network and a Multi-View Feature Fusion Network.

Core Contributions

Multi-View Representation:
- The study addresses the encoding of sparse 3D point clouds using a compact multi-view representation, involving both bird's eye view and front view mappings.
- For the bird's eye view, features such as height, intensity, and density are used, promoting a rich representation compatible with deep learning frameworks.
- Front view encapsulates height, distance, and intensity features, providing complementary data to enhance detection performance.
3D Proposal Network:
- Inspired by the structure of Region Proposal Networks (RPN), this subnetwork extracts 3D box proposals efficiently from the bird's eye view representation.
- Several optimization techniques are employed, including feature map upsampling to handle extra-small objects, thereby maintaining high-resolution feature maps without excessive computational load.
Region-based Fusion Network:
- Utilizes region-wise features derived from projected 3D proposals across multiple views (bird's eye view, front view, and RGB image plane).
- Introduces a deep fusion strategy that hierarchically integrates multi-view features, outperforming traditional early and late fusion methods.

Experimental Results

The experiments are conducted on the KITTI benchmark, a challenging dataset for autonomous driving perception tasks. The MV3D framework demonstrates considerable improvements over state-of-the-art techniques:

3D Proposal Recall:
- Achieves a recall rate of 99.1% and 91% at IoU thresholds of 0.25 and 0.5 respectively, using only 300 proposals. This notably surpasses the performances of recent 3D proposal generation methods such as 3DOP and Mono3D.
3D Localization and Detection:
- For 3D localization, the approach attains around 25% and 30% higher Average Precision (AP) compared to VeloFCN at IoU=0.5 and IoU=0.7 respectively.
- In 3D detection tasks, at IoU thresholds of 0.25 and 0.5, the MV3D network outperforms both LIDAR-only and multimodal methods substantially, securing above 85% AP in moderate test cases.
2D Detection:
- The approach also provides notable enhancements in 2D detection accuracy, achieving approximately 10.3% higher AP than existing LIDAR-based methods on KITTI's hard test set.

Implications and Future Work

The results obtained from the MV3D network significantly improve the performance metrics in both 3D localization and 3D object detection tasks, indicating the practical and theoretical advancement of using multi-view sensory fusion. The efficient integration of LIDAR and RGB data furthers the development of robust autonomous driving systems capable of precise environmental perception.

Speculative Future Developments

Extended Multi-Modal Fusion:
- Future research might explore the inclusion of additional sensory data such as radar or thermal imaging to further enhance detection capabilities, particularly in adverse weather conditions.
End-to-End Optimization:
- Integrating additional end-to-end architectures that can jointly optimize detection, tracking, and motion prediction within a unified framework could further streamline autonomous operations.
Deployment and Real-World Scalability:
- Studies concentrating on reducing computational complexity and enhancing real-time execution would be critical for deploying such models in commercial autonomous driving systems.

In conclusion, the MV3D network embodies a substantial progress in the field of 3D object detection, showcasing the efficiency and effectiveness of multi-view sensory fusion in enhancing the accuracy and reliability of autonomous driving perception systems.

Markdown Report Issue