Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition (2402.14505v3)
Abstract: Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.
- Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
- Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2998–3007, 2023.
- Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics, 24(5):1027–1037, 2008.
- Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307, 2016.
- Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
- Viewpoint invariant dense matching for visual geolocalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12169–12178, 2021.
- Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4878–4888, 2022a.
- Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5396–5407, 2022b.
- Unifying deep local and global features for image search. In European Conference on Computer Vision, pp. 726–743. Springer, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
- Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3223–3230. IEEE, 2017a.
- Only look once, mining distinctive landmarks from convnet for visual place recognition. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9–16. IEEE, 2017b.
- Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pp. 248–255, 2009.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
- Improving condition-and environment-invariant place recognition with semantic place categorization. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6863–6870. IEEE, 2017.
- Don’t look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3645–3652, 2018.
- Self-supervising fine-grained region similarities for large-scale image localization. In European conference on computer vision, pp. 369–386. Springer, 2020.
- Fab-map+ ratslam: Appearance-based slam for multiple times of day. In 2010 IEEE international conference on robotics and automation, pp. 3507–3512. IEEE, 2010.
- Hierarchical multi-process fusion for visual place recognition. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3327–3333. IEEE, 2020.
- Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14141–14152, 2021.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3304–3311. IEEE, 2010.
- Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
- Learned contextual feature reweighting for image geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136–2145, 2017.
- Anyloc: Towards universal visual place recognition. arXiv preprint arXiv:2308.00688, 2023.
- A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE transactions on robotics, 36(2):561–569, 2019.
- Contrastive alignment of vision to language through parameter-efficient transfer learning. In The Eleventh International Conference on Learning Representations, 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Generalized contrastive optimization of siamese networks for place recognition. arXiv preprint arXiv:2103.06638, 2021.
- Data-efficient large scale place recognition with graded similarity supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23487–23496, 2023.
- Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2041–2050, 2018.
- Stochastic attraction-repulsion embedding for large scale image localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2570–2579, 2019.
- Lightweight, viewpoint-invariant visual place recognition in changing environments. IEEE Robotics and Automation Letters, 3(2):957–964, 2018.
- Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
- Sta-vpr: Spatio-temporal alignment for visual place recognition. IEEE Robotics and Automation Letters, 6(3):4297–4304, 2021.
- Aanet: Aggregation and alignment network with semi-hard positive sample mining for hierarchical place recognition. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11771–11778. IEEE, 2023.
- Scalable 6-dof localization on mobile devices. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pp. 268–283. Springer, 2014.
- Semantics-aware visual localization under challenging perceptual conditions. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2614–2620. IEEE, 2017.
- Single-view place recognition under seasonal changes. arXiv preprint arXiv:1808.06516, 2018.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
- Dual-path adaptation from image to video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2203–2213, 2023.
- Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 885–894, 2021.
- Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947, 2020.
- Tcl: Tightly coupled learning strategy for weakly supervised hierarchical place recognition. IEEE Robotics and Automation Letters, 7(2):2684–2691, 2022.
- Structvpr: Distill structural knowledge with weighting samples for visual place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11217–11226, 2023.
- On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 4297–4304. IEEE, 2015.
- Visual place recognition with repetitive structures. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 883–890, 2013.
- 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1808–1817, 2015.
- Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13648–13657, 2022a.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
- Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2626–2635, 2020.
- Localizing discriminative visual landmarks for place recognition. In 2019 International conference on robotics and automation (ICRA), pp. 5979–5985. IEEE, 2019.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954, 2023.
- Probabilistic visual place recognition for hierarchical localization. IEEE Robotics and Automation Letters, 6(2):311–318, 2020.
- Aim: Adapting image models for efficient video action recognition. 2023.
- A multi-domain feature learning method for visual place recognition. In 2019 International Conference on Robotics and Automation (ICRA), pp. 319–324. IEEE, 2019.
- Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE transactions on neural networks and learning systems, 31(2):661–674, 2019.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19370–19380, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.