Regressing Transformers for Data-efficient Visual Place Recognition (2401.16304v1)
Abstract: Visual place recognition is a critical task in computer vision, especially for localization and navigation systems. Existing methods often rely on contrastive learning: image descriptors are trained to have small distance for similar images and larger distance for dissimilar ones in a latent space. However, this approach struggles to ensure accurate distance-based image similarity representation, particularly when training with binary pairwise labels, and complex re-ranking strategies are required. This work introduces a fresh perspective by framing place recognition as a regression problem, using camera field-of-view overlap as similarity ground truth for learning. By optimizing image descriptors to align directly with graded similarity labels, this approach enhances ranking capabilities without expensive re-ranking, offering data-efficient training and strong generalization across several benchmark datasets.
- M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in ICRA, 2012, pp. 1643–1649.
- S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2016.
- D. Doan, Y. Latif, T. Chin, Y. Liu, T. Do, and I. Reid, “Scalable place recognition under appearance change for autonomous driving,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9318–9327.
- N. Pion, M. Humenberger, G. Csurka, Y. Cabon, and T. Sattler, “Benchmarking image retrieval for visual localization,” in International Conference on 3D Vision, 2020.
- M. Zaffar, S. Garg, M. Milford, J. Kooij, D. Flynn, K. McDonald-Maier, and S. Ehsan, “Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change,” International Journal of Computer Vision, vol. 129, no. 7, pp. 2136–2174, 2021.
- F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” TPAMI, vol. 41, no. 7, pp. 1655–1668, 2018.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in CVPR, 2016, pp. 5297–5307.
- M. Lopez-Antequera, M. Leyva-Vallina, N. Strisciuglio, and N. Petkov, “Place and object recognition by cnn-based cosfire filters,” IEEE Access, vol. 7, pp. 66 157–66 166, 2019.
- R. Wang, Y. Shen, W. Zuo, S. Zhou, and N. Zhen, “Transvpr: Transformer-based place recognition with multi-level attention aggregation,” in CVPR, 2022.
- G. Berton, C. Masone, and B. Caputo, “Rethinking visual geo-localization for large-scale applications,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4868–4878.
- S. Zhu, L. Yang, C. Chen, M. Shah, X. Shen, and H. Wang, “R2former: Unified retrieval and reranking transformer for place recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 370–19 380.
- L. Liu, H. Li, and Y. Dai, “Stochastic attraction-repulsion embedding for large scale image localization,” in CVPR, 2019, pp. 2570–2579.
- F. Warburg, S. Hauberg, M. López-Antequera, P. Gargallo, Y. Kuang, and J. Civera, “Mapillary street-level sequences: A dataset for lifelong place recognition,” in CVPR, 2020.
- P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: Learning feature matching with graph neural networks,” in CVPR, 2020. [Online]. Available: https://arxiv.org/abs/1911.11763
- S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” in CVPR, 2021, pp. 14 141–14 152.
- B. Cao, A. Araujo, and J. Sim, “Unifying deep local and global features for image search,” in ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., 2020, pp. 726–743.
- M. Leyva-Vallina, N. Strisciuglio, and N. Petkov, “Data-efficient large scale place recognition with graded similarity supervision,” CVPR, 2023.
- V. Balntas, S. Li, and V. Prisacariu, “Relocnet: Continuous metric learning relocalisation using neural nets,” in ECCV, 2018, pp. 751–767.
- D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
- A. Torii, J. Sivic, M. Okutomi, and T. Pajdla, “Visual place recognition with repetitive structures,” TPAMI, 2015.
- F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Large-scale image retrieval with compressed fisher vectors,” in CVPR. IEEE, 2010, pp. 3384–3391.
- H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and C. Schmid, “Aggregating local image descriptors into compact codes,” TPAMI, vol. 34, no. 9, pp. 1704–1716, 2011.
- X. Zhang, L. Wang, and Y. Su, “Visual place recognition: A survey from deep learning perspective,” Pattern Recognition, p. 107760, 2020.
- C. Masone and B. Caputo, “A survey on deep visual place recognition,” IEEE Access, pp. 1–1, 2021.
- Z. Chen, O. Lam, A. Jacobson, and M. Milford, “Convolutional neural network-based place recognition,” ACRA, 2014.
- M. Leyva-Vallina, N. Strisciuglio, M. López-Antequera, R. Tylecek, M. Blaich, and N. Petkov, “Tb-places: A data set for visual place recognition in garden environments,” IEEE Access, 2019.
- M. Leyva-Vallina, N. Strisciuglio, and N. Petkov, “Place recognition in gardens by learning visual representations: data set and benchmark analysis,” in CAIP. Springer, 2019, pp. 324–335.
- A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “End-to-end learning of deep visual representations for image retrieval,” International Journal of Computer Vision, vol. 124, no. 2, pp. 237–254, 2017.
- M. Angelina Uy and G. Hee Lee, “Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition,” in CVPR, 2018, pp. 4470–4479.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017, pp. 652–660.
- J. Thoma, D. P. Paudel, and L. Van Gool, “Soft contrastive learning for visual localization,” NeurIPS, 2020.
- A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in CVPR, 2013.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy