VectorMapNet: End-to-end Vectorized HD Map Learning (2206.08920v6)
Abstract: Autonomous driving systems require High-Definition (HD) semantic maps to navigate around urban roads. Existing solutions approach the semantic mapping problem by offline manual annotation, which suffers from serious scalability issues. Recent learning-based methods produce dense rasterized segmentation predictions to construct maps. However, these predictions do not include instance information of individual map elements and require heuristic post-processing to obtain vectorized maps. To tackle these challenges, we introduce an end-to-end vectorized HD map learning pipeline, termed VectorMapNet. VectorMapNet takes onboard sensor observations and predicts a sparse set of polylines in the bird's-eye view. This pipeline can explicitly model the spatial relation between map elements and generate vectorized maps that are friendly to downstream autonomous driving tasks. Extensive experiments show that VectorMapNet achieve strong map learning performance on both nuScenes and Argoverse2 dataset, surpassing previous state-of-the-art methods by 14.2 mAP and 14.6mAP. Qualitatively, VectorMapNet is capable of generating comprehensive maps and capturing fine-grained details of road geometry. To the best of our knowledge, VectorMapNet is the first work designed towards end-to-end vectorized map learning from onboard observations. Our project website is available at \url{https://tsinghua-mars-lab.github.io/vectormapnet/}.
- Efficient interactive annotation of segmentation datasets with polygon-rnn++. 2018.
- Computing the discrete fréchet distance in subquadratic time. SIAM Journal on Computing, 43(2):429–449, 2014.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020.
- Understanding bird’s-eye view semantic hd-maps using an onboard monocular camera. arXiv preprint arXiv:2012.03040, 2020.
- Structured bird’s-eye-view traffic scene understanding from onboard images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15661–15670, 2021.
- End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Springer, 2020.
- Deepsvg: A hierarchical generative network for vector graphics animation. Advances in Neural Information Processing Systems, 33:16351–16361, 2020.
- Mp3: A unified model to map, perceive, predict and plan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14403–14412, 2021.
- Annotating object instances with a polygon-rnn. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5230–5238, 2017.
- Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8748–8757, 2019.
- Futr3d: A unified sensor fusion framework for 3d detection. arXiv preprint arXiv:2203.10642, 2022.
- Computing discrete fréchet distance. 1994.
- Rethinking efficient lane detection via curve modeling. arXiv preprint arXiv:2203.02431, 2022.
- Computer-aided design as language. Advances in Neural Information Processing Systems, 34:5885–5897, 2021.
- Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11525–11533, 2020.
- A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hierarchical recurrent attention networks for structured online maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3417–3426, 2018.
- Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705, 2019.
- Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE international conference on computer vision, pp. 1947–1955, 2017.
- Dn-detr: Accelerate detr training by introducing query denoising. arXiv preprint arXiv:2203.01305, 2022.
- Grass: Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG), 36(4):1–14, 2017.
- Hdmapnet: A local semantic map learning and evaluation framework. arXiv preprint arXiv:2107.06307, 2021.
- Line-cnn: End-to-end traffic line detection with line proposal unit. IEEE Transactions on Intelligent Transportation Systems, 21(1):248–258, 2019.
- Convolutional recurrent network for road boundary extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9512–9521, 2019.
- Polytransform: Deep polygon transformer for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9131–9140, 2020.
- End-to-end line drawing vectorization. 2022.
- Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7577–7586, 2021.
- Fixing weight decay regularization in adam. 2018.
- Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics and Automation Letters, 4(2):445–452, 2019.
- Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological cybernetics, 64(3):177–185, 1991.
- Enhancing road maps by parsing aerial images around the world. In Proceedings of the IEEE international conference on computer vision, pp. 1689–1697, 2015.
- Hd maps: Fine-grained road segmentation by parsing ground and aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3611–3619, 2016.
- Hdmapgen: A hierarchical graph generative model of high definition maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4227–4236, 2021.
- Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
- Polygen: An autoregressive generative model of 3d meshes. In International Conference on Machine Learning, pp. 7220–7229. PMLR, 2020.
- Towards end-to-end lane detection: an instance segmentation approach. In 2018 IEEE intelligent vehicles symposium (IV), pp. 286–291. IEEE, 2018.
- Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
- Spatial as deep: Spatial cnn for traffic scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European Conference on Computer Vision, pp. 194–210. Springer, 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660, 2017.
- Ramer, U. An iterative procedure for the polygonal approximation of plane curves. Comput. Graph. Image Process., 1:244–256, 1972.
- Im2vec: Synthesizing vector graphics without vector supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7342–7351, 2021.
- Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11138–11147, 2020.
- Lgsvl simulator: A high fidelity simulator for autonomous driving. arXiv preprint arXiv:2005.03778, 2020.
- Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In European Conference on Computer Vision, pp. 414–430. Springer, 2020.
- End-to-end lane detection through differentiable least-squares fitting. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Holistic 3d scene understanding from a single geo-tagged image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3964–3972, 2015.
- Torontocity: Seeing the world with a million eyes. arXiv preprint arXiv:1612.00423, 2016.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pp. 180–191. PMLR, 2022.
- Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021), 2021.
- Line segment detection using transformers without edges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4257–4266, 2021.
- Yamaguchi, K. Canvasvae: Learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5481–5489, 2021.
- Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning, pp. 146–155. PMLR, 2018.
- Projecting your view attentively: Monocular road scene layout estimation via cross-view transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15536–15545, 2021.
- Jointnet: A common neural network for road and building extraction. Remote Sensing, 11(6):696, 2019.
- Tnt: Target-driven trajectory prediction. arXiv preprint arXiv:2008.08294, 2020.
- Cross-view transformers for real-time map-view semantic segmentation. arXiv preprint arXiv:2205.02833, 2022.
- End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pp. 923–932. PMLR, 2020.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Polyworld: Polygonal building extraction with graph neural networks in satellite images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1848–1857, 2022.
- Lane graph estimation for scene understanding in urban driving. IEEE Robotics and Automation Letters, 6(4):8615–8622, 2021.