BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision (2211.10439v1)

Published 18 Nov 2022 in cs.CV

Abstract: We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.

Authors (12)

Chenyu Yang (20 papers)
Yuntao Chen (37 papers)
Hao Tian (146 papers)
Chenxin Tao (11 papers)
Xizhou Zhu (73 papers)
Zhaoxiang Zhang (162 papers)
Gao Huang (179 papers)
Hongyang Li (99 papers)
Yu Qiao (563 papers)
Lewei Lu (55 papers)
Jie Zhou (688 papers)
Jifeng Dai (131 papers)

Citations (207)

View on Semantic Scholar

Summary

The paper introduces perspective supervision to repurpose modern image backbones for enhanced BEV recognition without requiring depth pre-training.
It employs a two-stage detector design that integrates a perspective head with a BEV head to capture 3D environmental cues from 2D images.
Empirical results on nuScenes show significant gains with 63.4% NDS and 55.6% mAP, setting a new benchmark for autonomous driving applications.

Overview of BEVFormer v2: Adaptation of Modern Image Backbones for Bird's-Eye-View Recognition

The paper presents BEVFormer v2, an innovative approach to bird's-eye-view (BEV) detection that addresses current limitations tied to certain depth-trained backbones. It pioneers the use of perspective supervision that integrates seamlessly with contemporary image backbones, enhancing convergence and compatibility. This approach demonstrates substantial improvements on existing state-of-the-art (SoTA) results using the nuScenes dataset, underpinning a significant advancement in autonomous driving applications.

Contribution and Methodology

The principal contribution of this work lies in unleashing the untapped potential of contemporary image backbones within BEV recognition through perspective supervision. This is achieved by introducing a two-stage BEV detector characterized by:

Perspective Supervision: The paper introduces a unique perspective supervision mechanism that functions as an additional layer of supervision, redressing the reliance on depth-trained backbones. This supervision allows the image backbone to capture relevant 3D environmental cues, effectively bridging the gap between 2D image tasks and 3D scene perception.
Two-Stage BEV Detector Design: Proposals generated from a perspective head are relayed into a BEV head for final predictions, optimizing model efficacy. The use of perspective projections as inputs in the BEV head, implemented through hybrid object queries, enriches the detection capability by mitigating spatial variations in object distributions.
Temporal Encoder Redesign: The temporal encoder within BEVFormer v2 is improved to better assimilate long-term temporal information, enhancing the understanding of temporal context in BEV recognition, crucial for dynamic environments such as autonomous driving.

These elements collectively enhance detection performance while reducing convergence time, without necessitating pre-training on depth estimation tasks prevalent in prior methodologies.

Empirical Validation and Results

The empirical evaluation on the nuScenes benchmark demonstrates the efficacy of BEVFormer v2, showcasing a notable enhancement in performance metrics. Specifically, BEVFormer v2 achieves 63.4% NDS and 55.6% mAP, surpassing competing methods, thus evidencing the merits of perspective supervision paired with advanced image backbones like InternImage. This validates the scalability and robustness of the model across different backbone architectures and dataset configurations.

Implications and Future Directions

The methodologies advanced by BEVFormer v2 offer substantial implications for the future design of recognition frameworks in autonomous systems. By enabling modern image backbones to interface more effectively with BEV models, this work paves the way for further research into optimizing backbone architectures towards enhanced perception capabilities. In particular, the proposed integration of perspective view supervision in training regimes presents new paradigms in both theoretical exploration and practical application in real-world environments.

Further research could explore extending such strategies to leverage larger datasets and more sophisticated image backbones, fostering greater innovation in both the breadth and depth of BEV detection systems.

In conclusion, this work significantly enhances the adaptability and performance of BEV recognition frameworks through strategic modifications, setting a new benchmark for future exploration and application in the autonomous driving sector.

PDF Markdown