Learning Cross-view Visual Geo-localization without Ground Truth (2403.12702v1)
Abstract: Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image. Current state-of-the-art methods predominantly rely on training models with labeled paired images, incurring substantial annotation costs and training burdens. In this study, we investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels. We observe that training on unlabeled cross-view images presents significant challenges, including the need to establish relationships within unlabeled data and reconcile view discrepancies between uncertain queries and references. To address these challenges, we propose a self-supervised learning framework to train a learnable adapter for a frozen Foundation Model (FM). This adapter is designed to map feature distributions from diverse views into a uniform space using unlabeled data exclusively. To establish relationships within unlabeled data, we introduce an Expectation-Maximization-based Pseudo-labeling module, which iteratively estimates associations between cross-view features and optimizes the adapter. To maintain the robustness of the FM's representation, we incorporate an information consistency module with a reconstruction loss, ensuring that adapted features retain strong discriminative ability across views. Experimental results demonstrate that our proposed method achieves significant improvements over vanilla FMs and competitive accuracy compared to supervised methods, while necessitating fewer training parameters and relying solely on unlabeled data. Evaluation of our adaptation for task-specific models further highlights its broad applicability.
- P. Ren, Y. Tao, J. Han, and P. Li, “Hashing for geo-localization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
- X. Wan, Y. Shao, S. Zhang, and S. Li, “Terrain aided planetary uav localization based on geo-referencing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2022.
- X. Lu, S. Luo, and Y. Zhu, “It’s okay to be wrong: Cross-view geo-localization with step-adaptive iterative refinement,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
- Y. Sun, Y. Ye, J. Kang, R. Fernandez-Beltran, S. Feng, X. Li, C. Luo, P. Zhang, and A. Plaza, “Cross-view object geo-localization in a local region with satellite imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
- Z. Zheng, Y. Wei, and Y. Yang, “University-1652: A multi-view multi-source benchmark for drone-based geo-localization,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1395–1403.
- J. Lin, Z. Zheng, Z. Zhong, Z. Luo, S. Li, Y. Yang, and N. Sebe, “Joint representation learning and keypoint detection for cross-view geo-localization,” IEEE Transactions on Image Processing, vol. 31, pp. 3780–3792, 2022.
- T. Wang, Z. Zheng, C. Yan, J. Zhang, Y. Sun, B. Zheng, and Y. Yang, “Each part matters: Local patterns facilitate cross-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 867–879, 2021.
- M. Dai, J. Hu, J. Zhuang, and E. Zheng, “A transformer-based feature segmentation and region alignment method for uav-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4376–4389, 2021.
- H. Zhao, K. Ren, T. Yue, C. Zhang, and S. Yuan, “Transfg: A cross-view geo-localization of satellite and uavs imagery pipeline using transformer-based feature aggregation and gradient guidance,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024.
- S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolocalization with aerial reference imagery,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3961–3969.
- L. Liu and H. Li, “Lending orientation to neural networks for cross-view geo-localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5624–5633.
- S. Hu, M. Feng, R. M. Nguyen, and G. H. Lee, “Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7258–7267.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
- K. Regmi and M. Shah, “Bridging the domain gap for ground-to-aerial image matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 470–479.
- X. Zhang, X. Li, W. Sultani, Y. Zhou, and S. Wshah, “Cross-view geo-localization via learning disentangled geometric layout correspondence,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3480–3488.
- F. Deuser, K. Habel, and N. Oswald, “Sample4geo: Hard negative sampling for cross-view geo-localisation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 847–16 856.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
- M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez et al., “DINOv2: Learning robust visual features without supervision,” Transactions on Machine Learning Research, 2024.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- M. Wu and Q. Huang, “Im2city: image geo-localization via multi-modal learning,” in Proceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, 2022, pp. 50–61.
- V. V. Cepeda, G. K. Nayak, and M. Shah, “Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- L. Haas, S. Alberti, and M. Skreta, “Learning generalized zero-shot learners for open-domain image geolocalization,” arXiv preprint arXiv:2302.00275, 2023.
- N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “Anyloc: Towards universal visual place recognition,” IEEE Robotics and Automation Letters, 2023.
- J. Xiao, G. Zhu, and G. Loianno, “Visual geo-localization with self-supervised representation learning,” arXiv preprint arXiv:2308.00090, 2023.
- O. Pantazis, G. Brostow, K. Jones, and O. Mac Aodha, “Svl-adapter: Self-supervised adapter for vision-language pretrained models,” in Proceedings of The 33rd British Machine Vision Conference. The British Machine Vision Association, 2022.
- K. Pan, J. Li, H. Song, J. Lin, X. Liu, and S. Tang, “Self-supervised meta-prompt learning with meta-gradient regularization for few-shot generalization,” arXiv preprint arXiv:2303.12314, 2023.
- F. Lu, L. Zhang, X. Lan, S. Dong, Y. Wang, and C. Yuan, “Towards seamless adaptation of pre-trained models for visual place recognition,” in The Twelfth International Conference on Learning Representations, 2023.
- T. Chen, L. Zhu, C. Deng, R. Cao, Y. Wang, S. Zhang, Z. Li, L. Sun, Y. Zang, and P. Mao, “Sam-adapter: Adapting segment anything in underperformed scenes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3367–3375.
- F. Yang, Z. Wang, J. Xiao, and S. Satoh, “Mining on heterogeneous manifolds for zero-shot cross-modal image retrieval,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 589–12 596.
- M.-C. Xu, Y. Zhou, C. Jin, M. de Groot, D. C. Alexander, N. P. Oxtoby, Y. Hu, and J. Jacob, “Bayesian pseudo labels: Expectation maximization for robust and efficient semi-supervised segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 580–590.
- R. M. Neal and G. E. Hinton, “A view of the em algorithm that justifies incremental, sparse, and other variants,” in Learning in graphical models. Springer, 1998, pp. 355–368.
- G. Sumbul, M. Müller, and B. Demir, “A novel self-supervised cross-modal image retrieval method in remote sensing,” in 2022 IEEE International Conference on Image Processing. IEEE, 2022, pp. 2426–2430.
- T. Wang, Z. Zheng, Y. Sun, T.-S. Chua, Y. Yang, and C. Yan, “Multiple-environment self-adaptive network for aerial-view geo-localization,” arXiv preprint arXiv:2204.08381, 2022.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
- M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924, 2020.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, pp. 1–15, 2023.
- Z. Zheng, Y. Shi, T. Wang, J. Liu, J. Fang, Y. Wei, and T.-s. Chua, “Uavs in multimedia: Capturing the world from a new perspective,” in Proceedings of the 31th ACM International Conference on Multimedia Workshop, vol. 4, 2023.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- L. Ding, J. Zhou, L. Meng, and Z. Long, “A practical cross-view image matching method between uav and satellite for uav-based geo-localization,” Remote Sensing, vol. 13, no. 1, p. 47, 2020.
- T. Wang, Z. Zheng, Z. Zhu, Y. Gao, Y. Yang, and C. Yan, “Learning cross-view geo-localization embeddings via dynamic weighted decorrelation regularization,” arXiv preprint arXiv:2211.05296, 2022.
- C. Chen, J. Zhang, Y. Xu, L. Chen, J. Duan, Y. Chen, S. Tran, B. Zeng, and T. Chilimbi, “Why do we need large batchsizes in contrastive learning? a gradient-bias perspective,” Advances in Neural Information Processing Systems, vol. 35, pp. 33 860–33 875, 2022.