Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning (2402.02340v2)
Abstract: Deep Metric Learning (DML) has long attracted the attention of the machine learning community as a key objective. Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets. As a result of the success of recent pre-trained models trained from larger-scale datasets, it is challenging to adapt the model to the DML tasks in the local data domain while retaining the previously gained knowledge. In this paper, we investigate parameter-efficient methods for fine-tuning the pre-trained model for DML tasks. In particular, we propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT). Based on the conventional proxy-based DML paradigm, we augment the proxy by incorporating the semantic information from the input image and the ViT, in which we optimize the visual prompts for each class. We demonstrate that our new approximations with semantic information are superior to representative capabilities, thereby improving metric learning performance. We conduct extensive experiments to demonstrate that our proposed framework is effective and efficient by evaluating popular DML benchmarks. In particular, we demonstrate that our fine-tuning method achieves comparable or even better performance than recent state-of-the-art full fine-tuning works of DML while tuning only a small percentage of total parameters.
- Smooth-ap: Smoothing the path towards large-scale image retrieval. In ECCV, pp. 677–694. Springer, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, pp. 9650–9660, 2021.
- Plot: Prompt learning with optimal transport for vision-language models. 2023a.
- Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, 2022a.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649, 2021.
- Prompt-based metric learning for few-shot ner. arXiv preprint arXiv:2211.04337, 2022b.
- Vision transformer adapter for dense predictions. ICLR, 2023b.
- Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR, pp. 1335–1344, 2016.
- Learning a similarity metric discriminatively, with application to face verification. In CVPR, volume 1, pp. 539–546. IEEE, 2005.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644, 2021.
- Hyperbolic vision transformers: Combining improvements in metric learning. In CVPR, pp. 7409–7419, 2022.
- Compositional prompt tuning with motion cues for open-vocabulary video relation detection. ICLR, 2023.
- Self-supervising fine-grained region similarities for large-scale image localization. In ECCV, pp. 369–386. Springer, 2020.
- Neighbourhood components analysis. NIPS, 17, 2004.
- Dimensionality reduction by learning an invariant mapping. In CVPR, volume 2, pp. 1735–1742. IEEE, 2006.
- Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, pp. 16000–16009, 2022.
- In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Visual prompt tuning. In ECCV, pp. 709–727. Springer, 2022.
- Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022.
- Proxy anchor loss for deep metric learning. In CVPR, pp. 3238–3247, 2020.
- Hier: Metric learning beyond class labels via hierarchical regularization. In CVPR, pp. 19903–19912, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Cross-image-attention for conditional embeddings in deep metric learning. In CVPR, pp. 11070–11081, 2023.
- 3d object representations for fine-grained categorization. In ICCV workshop, pp. 554–561, 2013.
- M-adda: Unsupervised domain adaptation with deep metric learning. Domain adaptation for visual understanding, pp. 17–31, 2020.
- Rank-based distance metric learning: An application to image retrieval. In CVPR, pp. 1–8. IEEE, 2008.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, pp. 1096–1104, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Localize me anywhere, anytime: a multi-task point-retrieval approach. In ICCV, pp. 2434–2442, 2015.
- Exploring the limits of weakly supervised pretraining. In ECCV, pp. 181–196, 2018.
- No fuss distance metric learning using proxies. In CVPR, pp. 360–368, 2017.
- A metric learning reality check. In ECCV 2020, pp. 681–699. Springer, 2020.
- Deep metric learning via lifted structured feature embedding. In CVPR, pp. 4004–4012, 2016.
- Recall@ k surrogate loss with large batches and similarity mixup. In CVPR, pp. 7502–7511, 2022.
- Adapterhub: A framework for adapting transformers. arXiv preprint arXiv:2007.07779, 2020.
- Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. PMLR, 2021.
- Robust and decomposable average precision for image retrieval. NeurIPS, 34:23569–23581, 2021.
- Efficient parametrization of multi-domain deep neural networks. In CVPR, pp. 8119–8127, 2018.
- Improving diversity of image captioning through variational autoencoders and adversarial learning. In WACV, pp. 263–272. IEEE, 2019.
- Beyond the deep metric learning: enhance the cross-modal matching with adversarial discriminative domain regularization. In ICPR, pp. 10165–10172. IEEE, 2021.
- Towards improved proxy-based deep metric learning via data-augmented domain adaptation. In AAAI, 2024.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
- Non-isotropy regularization for proxy-based deep metric learning. In CVPR, pp. 7420–7430, 2022.
- Learnable structured clustering framework for deep metric learning. arXiv preprint arXiv:1612.01213, 2016.
- Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In ECCF, pp. 448–464. Springer, 2020.
- Training data-efficient image transformers & distillation through attention. In ICML, pp. 10347–10357. PMLR, 2021.
- The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778, 2018.
- Attention is all you need. NIPS, 30, 2017.
- It takes two to tango: Mixup for deep metric learning. ICLR, 2022.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Adversarial cross-modal retrieval. In Multimedia, pp. 154–162, 2017a.
- Deep factorized metric learning. In CVPR, pp. 7672–7682, 2023.
- Deep metric learning with angular loss. In CVPR, pp. 2593–2601, 2017b.
- Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, pp. 5022–5030, 2019.
- Cross-batch memory for embedding learning. In CVPR, pp. 6388–6397, 2020.
- Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2), 2009.
- Deep cosine metric learning for person re-identification. In WACV, pp. 748–756. IEEE, 2018.
- Retrieving and classifying affective images via deep metric learning. In AAAI, volume 32, 2018.
- Aim: Adapting image models for efficient video action recognition. ICLR, 2023.
- Hierarchical proxy-based loss for deep metric learning. In CVPR, pp. 1859–1868, 2022.
- Pcl: Proxy-based contrastive learning for domain generalization. In CVPR, pp. 7097–7107, 2022.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
- Attributable visual similarity learning. In CVPR, pp. 7532–7541, 2022.