BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Published 26 May 2022 in cs.CV | (2205.13542v3)

Abstract: Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.

Abstract PDF Upgrade to Chat

Citations (717)

View on Semantic Scholar

Summary

The paper demonstrates that BEVFusion achieves state-of-the-art 3D detection and BEV segmentation by unifying LiDAR geometry with camera semantics in a shared BEV space.
It introduces modality-specific encoders and a fully-convolutional BEV encoder to address spatial misalignments and efficiency bottlenecks.
Empirical validation on the nuScenes benchmark shows a 1.3% increase in mAP/NDS and a 13.6% boost in mIoU, highlighting its robust, real-world performance.

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

The paper explores an innovative approach to multi-sensor fusion for autonomous driving systems through BEVFusion, a framework designed to unify features from multiple modalities in a shared Bird's-Eye View (BEV) representation. This is particularly relevant for tasks like 3D object detection and map segmentation where both geometric and semantic information are crucial.

Key Contributions and Methodology

BEVFusion addresses the shortcomings of traditional point-level fusion methods by adopting the BEV space as the unified representation, preserving both geometric structure from LiDAR and semantic density from camera inputs. This choice allows for a seamless and task-agnostic framework that supports a variety of 3D perception tasks without significant architectural changes. The method boasts efficient BEV pooling which mitigates earlier efficiency bottlenecks, particularly the high computational cost associated with view transformation, achieving a 40x reduction in latency.

The paper provides a detailed examination of how converting features to BEV maintains geometric integrity while avoiding semantic loss, a common issue in previous LiDAR-based detectors. The methodological approach includes modality-specific encoders and a fully-convolutional BEV encoder to handle spatial misalignments post-fusion. Additionally, task-specific heads are introduced to support distinct tasks like 3D detection and BEV segmentation.

Empirical Validation

BEVFusion establishes a new state-of-the-art on the nuScenes benchmark, achieving 1.3% higher mAP and NDS in 3D object detection and a remarkable 13.6% higher mIoU in BEV map segmentation compared to existing fusion methods, with a significantly reduced computational overhead. The results are demonstrable across varying conditions and highlight BEVFusion's robustness, particularly for smaller and distant object detection, as well as challenging weather and lighting scenarios.

A comparative analysis against existing methods demonstrates the efficiency and enhanced performance of BEVFusion, especially when incorporated with end-to-end training. The framework's innovation lies in not only its high performance across various metrics but also its substantial reduction in computation and latency, underscoring its practicality for real-world applications.

Implications and Future Research

The implications of BEVFusion extend deeply into autonomous vehicle perception, promising improvements in both efficiency and accuracy. The research invites further exploration in areas such as more precise depth estimation and multi-task learning to bridge performance gaps encountered in joint training settings. The potential integration with additional sensor types like radars could further enhance BEVFusion’s applicability to a wider range of perception tasks.

Overall, BEVFusion paves the way for future research in sensor fusion with its task-agnostic design and efficient operation, serving as an impactful baseline for subsequent studies. It challenges the perception community to reconsider entrenched paradigms towards more integrated fusion strategies.

Conclusion

BEVFusion represents a noteworthy advancement in multi-sensor fusion, with profound implications for the future of autonomous driving systems. It effectively reconfigures the landscape for sensor integration and task management, prompting enhanced research into efficient, unified, and robust perception frameworks. The paper provides a comprehensive exploration of BEVFusion's capabilities, establishing a foundation for ongoing innovations in AI-driven perception technologies.

Markdown Report Issue