Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval (2407.15566v1)
Abstract: The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.
- Aasif Ansari and Muzammil H Mohammed. 2015. Content based video retrieval systems-methods, techniques, trends and challenges. International Journal of Computer Applications 112, 7 (2015).
- LAMV: Learning to align and match videos with kernelized temporal layers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7804–7813.
- Parametric correspondence and chamfer matching: Two new techniques for image matching. In Proceedings: Image Understanding Workshop. Science Applications, Inc, 21–27.
- Smooth-ap: Smoothing the path towards large-scale image retrieval. In European Conference on Computer Vision. Springer, 677–694.
- Million-scale near-duplicate video retrieval system. In ACM International Conference on Multimedia. 837–838.
- Deep metric learning to rank. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1861–1870.
- Learning to rank: from pairwise approach to listwise approach. In International Conference on Machine Learning. 129–136.
- Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision. 9650–9660.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. PMLR, 1597–1607.
- Learning a similarity metric discriminatively, with application to face verification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, 539–546.
- Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia 17, 3 (2015), 382–395.
- Randaugment: Practical automated data augmentation with a reduced search space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop. 702–703.
- DRAUC: An Instance-wise Distributionally Robust AUC Optimization Framework. Advances in Neural Information Processing Systems 36 (2024).
- Imagenet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. Ieee, 248–255.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Transactions on Multimedia 12, 4 (2010), 257–266.
- Video re-localization. In European Conference on Computer Vision. 51–66.
- Dimensionality reduction by learning an invariant mapping. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1735–1742.
- Unsupervised Semantic Segmentation by Distilling Feature Correspondences. In International Conference on Learning Representations.
- Direct loss minimization for structured prediction. Advances in Neural Information Processing Systems 23 (2010).
- Momentum contrast for unsupervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.
- Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.
- TransVCL: attention-enhanced video copy localization network with flexible supervision. In Association for the Advancement of Artificial Intelligence, Vol. 37. 799–807.
- Learn from unlabeled videos for near-duplicate video retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 1002–1011.
- Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning. PMLR, 4182–4192.
- A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6 (2011), 797–819.
- SVD: A large-scale short video dataset for near-duplicate video retrieval. In International Conference on Computer Vision. 5281–5289.
- VCDB: a large-scale database for partial copy detection in videos. In European Conference on Computer Vision. Springer, 357–371.
- FIVR: Fine-grained incident video retrieval. IEEE Transactions on Multimedia 21, 10 (2019), 2638–2652.
- Visil: Fine-grained spatio-temporal video similarity learning. In International Conference on Computer Vision. 6351–6360.
- Near-duplicate video retrieval by aggregating intermediate cnn layers. In MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23. Springer, 251–263.
- Near-duplicate video retrieval with deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision workshops. 347–356.
- Self-Supervised Video Similarity Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4755–4765.
- DnS: Distill-and-select for efficient and accurate video indexing and retrieval. International Journal of Computer Vision 130, 10 (2022), 2385–2407.
- Video copy detection: a comparative study. In Proceedings of the 6th ACM international conference on Image and video retrieval. 371–378.
- Collaborative deep metric learning for video understanding. In Proceedings of the 24th ACM SIGKDD International conference on knowledge discovery and data mining. 481–490.
- IR feature embedded bof indexing method for near-duplicate video retrieval. IEEE Transactions on Circuits and Systems for Video Technology 29, 12 (2018), 3743–3753.
- Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- Efficient optimization for average precision svm. Advances in Neural Information Processing Systems 27 (2014).
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
- A family of contextual measures of similarity between distributions with application to image retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2358–2365.
- A self-supervised descriptor for image copy detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14532–14542.
- Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations.
- Temporal matching kernel with explicit feature maps. In ACM International Conference on Multimedia. 381–390.
- Learning with average precision: Training image retrieval with a listwise loss. In International Conference on Computer Vision. 5107–5116.
- Event retrieval in large video collections with circulant temporal encoding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2459–2466.
- Optimizing rank-based metrics with blackbox differentiation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7620–7630.
- Facenet: A unified embedding for face recognition and clustering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 815–823.
- Real-time large scale near-duplicate web video retrieval. In ACM International Conference on Multimedia. 531–540.
- Weighted roc curve in cost space: Extending auc to cost-sensitive learning. Advances in Neural Information Processing Systems 36 (2024).
- Temporal context aggregation for video retrieval with contrastive learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 3268–3278.
- Training deep neural networks via direct loss minimization. In International Conference on Machine Learning. PMLR, 2169–2177.
- Circle loss: A unified perspective of pair similarity optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6398–6407.
- Scalable detection of partial near-duplicate videos by visual-temporal consistency. In ACM International Conference on Multimedia. 145–154.
- Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 30 (2017).
- Evgeniya Ustinova and Victor Lempitsky. 2016. Learning deep embeddings with histogram loss. Advances in Neural Information Processing Systems 29 (2016).
- Compact CNN based video representation for efficient video copy detection. In MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23. Springer, 576–587.
- Openauc: Towards auc-oriented open-set recognition. Advances in Neural Information Processing Systems 35 (2022), 25033–25045.
- Optimizing partial area under the top-k curve: Theory and practice. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2022), 5053–5069.
- Kilian Q Weinberger and Lawrence K Saul. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research 10, 2 (2009).
- Exploring the algorithm-dependent generalization of auprc optimization with list stability. Advances in Neural Information Processing Systems 35 (2022), 28335–28349.
- Algorithm-Dependent Generalization of AUPRC Optimization: Theory and Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Not all samples are trustworthy: Towards deep robust svp prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6 (2020), 3154–3169.
- Learning with multiclass AUC: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 7747–7763.
- Optimizing two-way partial auc with an end-to-end framework. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2022), 10228–10246.
- A support vector method for optimizing average precision. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 271–278.
- Weihong Zhang and Ying Zhou. 2020. The feature-driven method for structural optimization. Elsevier.