Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization (2308.12609v2)
Abstract: Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929.
- L. Huang, L. Wang, and H. Li, “Foreground-action consistency network for weakly supervised temporal action localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8002–8011.
- L. Huang, L. Wang, and H. Li, “Weakly supervised temporal action localization via representative snippet knowledge propagation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3272–3281.
- P. Lee, Y. Uh, and H. Byun, “Background Suppression Network for Weakly-Supervised Temporal Action Localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, Apr. 2020, pp. 11 320–11 327.
- P. Nguyen, T. Liu, G. Prasad, and B. Han, “Weakly supervised action localization by sparse temporal pooling network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6752–6761.
- L. Feng, C. Zhao, and X. Li, “Bias-eliminated semantic refinement for any-shot learning,” IEEE Transactions on Image Processing, vol. 31, pp. 2229–2244, 2022.
- C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann, “Devnet: A deep event network for multimedia event detection and evidence recounting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2568–2577.
- L. Huang, Y. Huang, W. Ouyang, and L. Wang, “Relational Prototypical Network for Weakly Supervised Temporal Action Localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, Apr. 2020, pp. 11 053–11 060.
- C. Zhang, M. Cao, D. Yang, J. Chen, and Y. Zou, “CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, Jun. 2021, pp. 16 005–16 014.
- J. Li, T. Yang, W. Ji, J. Wang, and L. Cheng, “Exploring denoised cross-video contrast for weakly-supervised temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 914–19 924.
- W. Zhou, Y. Li, and C. Zhao, “Object-guided and motion-refined attention network for video anomaly detection,” in 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6.
- A. Islam, C. Long, and R. Radke, “A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, May 2021, pp. 1637–1645.
- W. Yang, T. Zhang, X. Yu, T. Qi, Y. Zhang, and FengWu, “Uncertainty Guided Collaborative Training for Weakly Supervised Temporal Action Detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp. 53–63.
- S. Qu, G. Chen, Z. Li, L. Zhang, F. Lu, and A. Knoll, “Acm-net: Action context modeling network for weakly-supervised temporal action localization,” arXiv preprint arXiv:2104.02967, 2021.
- L. Huang, L. Wang, and H. Li, “Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 7982–7991.
- Y. Wang, Y. Li, and H. Wang, “Two-stream networks for weakly-supervised temporal action localization with semantic-aware mechanisms,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 878–18 887.
- R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7094–7103.
- Z. Luo, D. Guillory, B. Shi, W. Ke, F. Wan, T. Darrell, and H. Xu, “Weakly-supervised action localization with expectation-maximization multi-instance learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16. Springer, 2020, pp. 729–745.
- Y. Zhai, L. Wang, W. Tang, Q. Zhang, J. Yuan, and G. Hua, “Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization,” in Computer Vision – ECCV 2020, ser. Lecture Notes in Computer Science, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 37–54.
- Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11220. Cham: Springer International Publishing, 2018, pp. 162–179.
- B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, and A. Shrivastava, “ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun. 2022, pp. 13 915–13 925.
- C. Ju, K. Zheng, J. Liu, P. Zhao, Y. Zhang, J. Chang, Q. Tian, and Y. Wang, “Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 751–14 762.
- H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The THUMOS challenge on action recognition for videos “in the wild”,” Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017.
- F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970.
- Y. Liu, L. Wang, Y. Wang, X. Ma, and Y. Qiao, “Fineaction: A fine-grained video dataset for temporal action localization,” IEEE Transactions on Image Processing, vol. 31, pp. 6937–6950, 2022.
- T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “Bsn: Boundary sensitive network for temporal action proposal generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, “Temporal action detection with structured segment networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2914–2923.
- L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “Untrimmednets for weakly supervised action recognition and detection,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 4325–4334.
- K. Min and J. J. Corso, “Adversarial background-aware loss for weakly-supervised temporal activity localization,” in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV, vol. 12359, 2020, pp. 283–299.
- S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao, “3c-net: Category count and center loss for weakly-supervised action localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8679–8687.
- S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-TALC: Weakly-Supervised Temporal Activity Localization and Classification,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11208. Cham: Springer International Publishing, 2018, pp. 588–607.
- W. Sun, R. Su, Q. Yu, and D. Xu, “Slow motion matters: A slow motion enhanced network for weakly supervised temporal action localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 354–366, 2022.
- B. Wang, X. Zhang, and Y. Zhao, “Exploring sub-action granularity for weakly supervised temporal action localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2186–2198, 2021.
- A. Islam, C. Long, and R. Radke, “A hybrid attention mechanism for weakly-supervised temporal action localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1637–1645.
- C. Ju, P. Zhao, S. Chen, Y. Zhang, X. Zhang, Y. Wang, and Q. Tian, “Adaptive mutual supervision for weakly-supervised temporal action localization,” IEEE Transactions on Multimedia, 2022.
- R. Zeng, C. Gan, P. Chen, W. Huang, Q. Wu, and M. Tan, “Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization,” IEEE Transactions on Image Processing, vol. 28, no. 12, pp. 5797–5808, 2019.
- W. Yang, T. Zhang, Z. Mao, Y. Zhang, Q. Tian, and F. Wu, “Multi-scale structure-aware network for weakly supervised temporal action detection,” IEEE Transactions on Image Processing, vol. 30, pp. 5848–5861, 2021.
- P. Song and C. Zhao, “Slow down to go better: A survey on slow feature analysis,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- K. Kumar Singh and Y. Jae Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3524–3533.
- K. Min and J. J. Corso, “Adversarial background-aware loss for weakly-supervised temporal activity localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 283–299.
- A. Pardo, H. Alwassel, F. C. Heilbron, A. Thabet, and B. Ghanem, “RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa, HI, USA: IEEE, Jan. 2021, pp. 3318–3327.
- X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 750–15 758.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742.
- Z. Liu, C. Zhao, Y. Lu, Y. Jiang, and J. Yan, “Multi-scale graph learning for ovarian tumor segmentation from ct images,” Neurocomputing, vol. 512, pp. 398–407, 2022.
- P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020.
- Z. Liu and C. Zhao, “Semi-supervised medical image segmentation via geometry-aware consistency training,” arXiv preprint arXiv:2202.06104, 2022.
- L. Tao, X. Wang, and T. Yamasaki, “An improved inter-intra contrastive learning framework on self-supervised video representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5266–5280, 2022.
- Z. Chen, K.-Y. Lin, and W.-S. Zheng, “Consistent intra-video contrastive learning with asynchronous long-term memory bank,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1168–1180, 2022.
- J. Huang, Y. Huang, Q. Wang, W. Yang, and H. Meng, “Self-supervised representation learning for videos by segmenting via sampling rate order prediction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3475–3489, 2021.
- B. Xu, X. Shu, and Y. Song, “X-invariant contrastive augmentation and representation learning for semi-supervised skeleton-based action recognition,” IEEE Transactions on Image Processing, vol. 31, pp. 3852–3867, 2022.
- X. Shu, B. Xu, L. Zhang, and J. Tang, “Multi-granularity anchor-contrastive representation learning for semi-supervised skeleton-based action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- B. Xu and X. Shu, “Pyramid self-attention polymerization learning for semi-supervised skeleton-based action recognition,” arXiv preprint arXiv:2302.02327, 2023.
- B. Xu, X. Shu, J. Zhang, G. Dai, and Y. Song, “Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- W. Luo, T. Zhang, W. Yang, J. Liu, T. Mei, F. Wu, and Y. Zhang, “Action Unit Memory Network for Weakly Supervised Temporal Action Localization,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, Jun. 2021, pp. 9964–9974.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, vol. 12346, 2020, pp. 213–229.
- T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial intelligence, vol. 89, no. 1-2, pp. 31–71, 1997.
- Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” Advances in neural information processing systems, vol. 31, 2018.
- A. Ghosh, N. Manwani, and P. Sastry, “Making risk minimization tolerant to label noise,” Neurocomputing, vol. 160, pp. 93–107, 2015.
- F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
- C. Xu, Q. Li, J. Ge, J. Gao, X. Yang, C. Pei, F. Sun, J. Wu, H. Sun, and W. Ou, “Privileged features distillation at taobao recommendations,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 2590–2598.
- V. Vapnik, R. Izmailov et al., “Learning using privileged information: similarity control and knowledge transfer.” J. Mach. Learn. Res., vol. 16, no. 1, pp. 2023–2049, 2015.
- A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
- J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher et al., “A survey of uncertainty in deep neural networks,” Artificial Intelligence Review, pp. 1–77, 2023.
- L. Smith and Y. Gal, “Understanding measures of uncertainty for adversarial example detection,” arXiv preprint arXiv:1803.08533, 2018.
- T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “Bmn: Boundary-matching network for temporal action proposal generation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3889–3898.
- M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem, “G-tad: Sub-graph localization for temporal action detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 156–10 165.
- X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, and X. Bai, “End-to-end temporal action detection with transformer,” IEEE Trans. Image Process., vol. 31, pp. 5427–5441, 2022.
- Y. Xu, C. Zhang, Z. Cheng, J. Xie, Y. Niu, S. Pu, and F. Wu, “Segregated temporal assembly recurrent networks for weakly supervised multiple action detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 9070–9078.
- Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “Autoloc: Weakly-supervised temporal action localization in untrimmed videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 154–171.
- P. Nguyen, B. Han, T. Liu, and G. Prasad, “Weakly Supervised Action Localization by Sparse Temporal Pooling Network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 6752–6761.
- B. Shi, Q. Dai, Y. Mu, and J. Wang, “Weakly-supervised action localization by generative attention modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1009–1019.
- Y. Zhai, L. Wang, W. Tang, Q. Zhang, J. Yuan, and G. Hua, “Two-stream consensus network for weakly-supervised temporal action localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer, 2020, pp. 37–54.
- L. Huang, Y. Huang, W. Ouyang, and L. Wang, “Modeling Sub-Actions for Weakly Supervised Temporal Action Localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5154–5167, 2021.
- S. Narayan, H. Cholakkal, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “D2-net: Weakly-supervised action localization via discriminative embeddings and denoised activations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 608–13 617.
- F.-T. Hong, J.-C. Feng, D. Xu, Y. Shan, and W.-S. Zheng, “Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization,” Jul. 2021.
- J. Gao, M. Chen, and C. Xu, “Fine-grained temporal contrastive learning for weakly-supervised temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 999–20 009.
- J. Li, T. Yang, W. Ji, J. Wang, and L. Cheng, “Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun. 2022, pp. 19 882–19 892.
- J. Ma, S. K. Gorti, M. Volkovs, and G. Yu, “Weakly supervised action selection learning in video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7587–7596.
- Z. Li, L. He, and H. Xu, “Weakly-supervised temporal action detection for fine-grained videos with hierarchical atomic actions,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part X, vol. 13670, 2022, pp. 567–584.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
- M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in kdd, vol. 96, no. 34, 1996, pp. 226–231.
- Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
- X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation-maximization attention networks for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9167–9176.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.