Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PolarFormer: Multi-camera 3D Object Detection with Polar Transformer (2206.15398v6)

Published 30 Jun 2022 in cs.CV and cs.AI

Abstract: 3D object detection in autonomous driving aims to reason "what" and "where" the objects of interest present in a 3D world. Following the conventional wisdom of previous 2D object detection, existing methods often adopt the canonical Cartesian coordinate system with perpendicular axis. However, we conjugate that this does not fit the nature of the ego car's perspective, as each onboard camera perceives the world in shape of wedge intrinsic to the imaging geometry with radical (non-perpendicular) axis. Hence, in this paper we advocate the exploitation of the Polar coordinate system and propose a new Polar Transformer (PolarFormer) for more accurate 3D object detection in the bird's-eye-view (BEV) taking as input only multi-camera 2D images. Specifically, we design a cross attention based Polar detection head without restriction to the shape of input structure to deal with irregular Polar grids. For tackling the unconstrained object scale variations along Polar's distance dimension, we further introduce a multi-scalePolar representation learning strategy. As a result, our model can make best use of the Polar representation rasterized via attending to the corresponding image observation in a sequence-to-sequence fashion subject to the geometric constraints. Thorough experiments on the nuScenes dataset demonstrate that our PolarFormer outperforms significantly state-of-the-art 3D object detection alternatives.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yanqin Jiang (7 papers)
  2. Li Zhang (693 papers)
  3. Zhenwei Miao (8 papers)
  4. Xiatian Zhu (139 papers)
  5. Jin Gao (38 papers)
  6. Weiming Hu (91 papers)
  7. Yu-Gang Jiang (223 papers)
Citations (150)

Summary

  • The paper introduces a novel PolarFormer framework that leverages the polar coordinate system to overcome limitations of Cartesian detection methods.
  • It employs a cross-attention encoder to transform multi-scale image features into a coherent polar BEV representation for improved 3D object detection.
  • Experimental results on nuScenes demonstrate significant gains in mAP, NDS, and reduced error metrics, highlighting its potential in autonomous driving.

Overview of PolarFormer: Multi-camera 3D Object Detection with Polar Transformer

The paper under discussion proposes a novel framework called PolarFormer for 3D object detection utilizing the Polar coordinate system. Specifically designed for autonomous driving applications, PolarFormer aims to transcend the limitations innate to the Cartesian coordinate, traditional in 3D object detection, by adopting a Polar representation scheme that aligns more naturally with the on-board camera systems' perception geometry in autonomous vehicles.

Introduction to PolarFormer

The foundational idea behind PolarFormer is the adoption of the Polar coordinate system, which circumvents the inherent geometric constraints imposed by the Cartesian system, such as irregular grid alignment and the need for computationally expensive depth estimation. PolarFormer employs a Polar coordinate system uniquely suited to the wedge-shaped field-of-view of car-mounted cameras. This approach facilitates more accurate Bird's Eye View (BEV) 3D object detection by leveraging the wedge-like perception of cameras, aligning closely with the intrinsic imaging geometry.

Methodology

The PolarFormer architecture is systematically constructed around a cross-attention-based Polar detection mechanism that processes input data consisting of only multi-camera 2D images. The core components of this architecture include:

  1. Cross-plane Encoder: This module functions by transforming horizontal image planes to a series of Polar rays via cross-attention. These rays manage to encapsulate the multi-scale features extracted from images, aiming to alleviate irregularities associated with Polar coordinate grids.
  2. Polar Alignment and BEV Encoding: The model introduces a Polar alignment process that renders Polar rays from individual cameras coherent across a shared world coordinate, subsequently facilitating an enhanced BEV Polar representation. The BEV encoder operates at multiple scales to handle the intrinsic variability in object scale across different distances, emphasizing high-dimensional feature interaction across these scales.
  3. Polar Detection Head: This head decodes the synthesized Polar BEV representation, generating object detection predictions in Polar coordinates, thus preserving the inherent geometrical alignment with input imaging data.

Results and Evaluation

Rigorous experimentation on the nuScenes dataset validates the proficiency of PolarFormer, asserting a substantial enhancement over existing state-of-the-art methodologies in 3D object detection spanning multiple camera perspectives. In particular, PolarFormer exhibits significant improvements in Mean Average Precision (mAP) and NuScenes Detection Score (NDS), indicative of its efficacy and robustness in real-world scenarios. The results also underscore reduced Average Translation Error (mATE) and Average Orientation Error (mAOE), underscoring a superior geometrical accuracy.

Implications and Future Directions

The introduction of Polar coordinates for 3D object detection in PolarFormer promises a shift in how perception systems in autonomous vehicles are designed. By aligning perception models closer to the intrinsic imaging properties of vehicular cameras, PolarFormer sets a new benchmark for precision in object detection tasks. The potential future advancements might include optimized model heads tailored specifically for Polar coordinates, more sophisticated algorithms for multi-scale feature processing, and further deployment adaptations to exploit the benefits of temporal data.

Overall, PolarFormer not only provides a viable alternative to Cartesian detection frameworks but also introduces a paradigm that could revolutionize perception-specific tasks in autonomous systems, all while adhering rigorously to computational efficiency and geometric congruency. Such advancements hold promise for broader applications beyond the autonomous driving context, potentially impacting various domains that leverage multi-camera systems.