TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving (2205.15997v1)

Published 31 May 2022 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.

Citations (232)

View on Semantic Scholar

Summary

The paper introduces a transformer-based fusion model that integrates LiDAR and camera data for enhanced perception in autonomous driving.
The model overcomes global contextual reasoning limitations, achieving a 48% reduction in collision rates on the CARLA simulator.
It establishes a new benchmark for evaluating complex driving scenarios, paving the way for future research in multimodal sensor fusion.

Autonomous Driving with Transformer-Based Sensor Fusion: Insights from the TransFuser Approach

The paper "TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving" by Kashyap Chitta et al. introduces TransFuser, a novel approach leveraging transformer-based sensor fusion for autonomous driving. The main innovation lies in using transformers to integrate image and LiDAR data, addressing significant challenges seen in complex driving scenarios with high dynamic agent densities.

In autonomous driving, integrating complementary sensor information such as LiDAR and camera data can enhance the system's perceptual understanding. LiDAR offers precise 3D spatial data, whereas cameras provide texture-rich information. Traditional methods have largely relied on geometry-based fusion for perception tasks like object detection and motion forecasting. Such methods often fall short in end-to-end driving contexts, especially under complex conditions. The authors identify these limitations and propose TransFuser, which employs self-attention mechanisms from transformers to fuse sensor representations at multiple resolutions. This approach provides a more coherent understanding of the driving scene by capturing both geometric and semantic nuances across modalities.

The paper details the architecture of TransFuser, emphasizing its ability to address the global contextual reasoning limitations of traditional convolutional networks. The network uses transformer modules designed to handle the bidirectional attention across sensor modalities, effectively capturing interactions between dynamic agents and infrastructure elements, such as traffic lights, in driving environments. The incorporation of multiple transformer modules at various feature extraction stages allows the network to maintain high-resolution context awareness, a significant advantage over previous fusion methods.

An evaluation conducted using the CARLA urban driving simulator demonstrates TransFuser's superior performance. The model achieves notable improvements over existing methods, evidenced by a significantly higher driving score on the CARLA leaderboard. TransFuser also significantly reduces collision rates—by 48%—compared to geometry-based fusion methods, indicating its effectiveness in navigating complex, dynamic scenarios. The authors support this with compelling empirical analysis using a challenging independent benchmark they propose, termed Longest6, which involves extended routes and dense traffic conditions.

This paper provides several key contributions to the field of autonomous driving:

It identifies the shortcomings of imitation learning with conventional sensor fusion techniques under complex driving conditions.
It presents a transformers-based solution that integrates image and LiDAR data for enhanced end-to-end driving performance.
It proposes a new autonomous driving benchmark to facilitate meaningful evaluation of future autonomous systems under dense and complex traffic situations.

The implications of this research are both practical and theoretical. Practically, the attention-based sensor fusion improves situational awareness and decision-making in autonomous systems, potentially reducing accident rates and enhancing overall traffic efficiency. The paper challenges the dominant use of convolutional techniques for sensor fusion and opens doors to more generalized attention-based strategies, which can be applied to the integration of other sensor types. Theoretically, this research provides a foundational understanding of multimodal representation learning with transformers, encouraging further exploration into how self-attention across multiple resolutions and modalities can be leveraged in other applications beyond autonomous driving.

In conclusion, while TransFuser presents a substantial advancement in end-to-end autonomous driving by effectively blending sensory data through transformers, it also poses new avenues for artificial intelligence research by melding multimodal learning with self-attentive architectures. Future work could explore the inclusion of additional sensor types, temporal sequence processing, and contributions to real-world deployment challenges, such as handling unseen scenarios and mitigating latency issues. TransFuser’s contribution to the field underscores the potential for transformer architectures to transform the landscape of learning frameworks in autonomous driving.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bern_jaeger/status/1800557732589912491