CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Published 5 Jul 2022 in cs.CV | (2207.02202v2)

Abstract: Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial sensing for autonomous driving. Although recent literature has made significant progress on BEV map understanding, they are all based on single-agent camera-based systems. These solutions sometimes have difficulty handling occlusions or detecting distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V) communication technologies have enabled autonomous vehicles to share sensing information, dramatically improving the perception performance and range compared to single-agent systems. In this paper, we propose CoBEVT, the first generic multi-agent multi-camera perception framework that can cooperatively generate BEV map predictions. To efficiently fuse camera features from multi-view and multi-agent data in an underlying Transformer architecture, we design a fused axial attention module (FAX), which captures sparsely local and global spatial interactions across views and agents. The extensive experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks, including 1) BEV segmentation with single-agent multi-camera and 2) 3D object detection with multi-agent LiDAR systems, achieving state-of-the-art performance with real-time inference speed. The code is available at https://github.com/DerrickXuNu/CoBEVT.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (172)

View on Semantic Scholar

Summary

The paper presents CoBEVT, a framework that leverages cooperative multi-agent data fusion using sparse transformers to significantly enhance BEV semantic segmentation.
It introduces the innovative FAX module, which efficiently combines local and global spatial interactions from multi-camera inputs through V2V communication.
Experiments show a 22.7% improvement over single-agent models and a 6.9% gain over state-of-the-art methods on the OPV2V dataset, underscoring its impact on AV perception.

Overview of CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

The paper "CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers" presents an innovative framework aimed at enhancing the perception capabilities of autonomous vehicles (AVs) by leveraging multi-agent cooperation. The proposed framework, CoBEVT, introduces a novel approach to generate Bird's Eye View (BEV) map predictions through the integration of multi-agent multi-camera data employing sparse vision transformers. The introduction of Vehicle-to-Vehicle (V2V) communication facilitates the sharing of sensory information among autonomous vehicles, which significantly augments perception performance, particularly in complex traffic environments where single-agent systems may struggle due to issues such as occlusions or limited depth-of-field.

The core component of CoBEVT is the Fused Axial Attention module (FAX), designed to efficiently capture spatial interactions both locally and globally across multiple views and agents. The paper meticulously details the design and architecture of CoBEVT, demonstrating its superior performance on the V2V perception dataset OPV2V. The proposed model not only achieves state-of-the-art performance in cooperative BEV semantic segmentation but also proves to be generalizable to other related tasks such as single-agent multi-camera BEV segmentation and multi-agent LiDAR-based 3D object detection.

Methodological Insights

CoBEVT's methodology is grounded in the innovative use of sparse transformers to facilitate the cooperative processing of sensory data. The architecture includes two key components: SinBEVT and FuseBEVT. SinBEVT is responsible for computing individual BEV features from multi-camera inputs for each agent. These features are subsequently shared among agents using V2V communication and fused together in the FuseBEVT module to produce a comprehensive BEV map.

A notable contribution of this work is the introduction of the FAX module, which orchestrates both self-attention and cross-attention mechanisms critical for handling the sparsity of the data and the large spatial dimensions involved in multi-agent scenarios. The sparse nature of the FAX attention allows for efficient computation, encompassing both local windowed attention and global interactions, which is particularly beneficial for understanding road conditions and dynamic traffic states.

Numerical Results and Performance

The experimental results presented in the paper showcase CoBEVT's efficacy, with the framework achieving notable performance improvements over prior state-of-the-art models. Specifically, CoBEVT demonstrates a substantial performance gain of 22.7% over a single-agent baseline and 6.9% over the leading contemporary models on the OPV2V dataset. These improvements are attributed to CoBEVT's ability to leverage cooperative sensory data and the efficient transformation and fusion of this data into a unified holistic BEV representation.

Theoretical and Practical Implications

The practical implications of CoBEVT are significant for the field of autonomous driving, providing a scalable vision-based solution that can operate effectively without reliance on costly LiDAR sensors. Theoretically, the approach offers a novel perspective on how sparse transformers can be utilized in multi-agent scenarios to enhance perception capabilities. The paper's insights into the design of FAX attention mechanisms may extend to other domains requiring efficient processing of high-dimensional sensory data.

Future Directions

Potential future advancements in this area could explore the integration of real-world imperfections such as synchronization issues and pose inaccuracies in V2V communications. Additionally, extending the model's robustness and accuracy under diverse environmental conditions and real-world datasets remains an open research avenue. The exploration of CoBEVT's adaptability to other autonomous driving tasks and sensor modalities could further substantiate the framework's versatility and apply its foundational principles to broader contexts within intelligent transportation systems.

Markdown Report Issue