Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Query-based Video Summarization with Pseudo Label Supervision (2307.01945v1)

Published 4 Jul 2023 in cs.CV, cs.AI, and cs.IR

Abstract: Existing datasets for manually labelled query-based video summarization are costly and thus small, limiting the performance of supervised deep video summarization models. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model both the relationship between a pretext task and a target task, and the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Furthermore, we propose mutual attention to help capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. “Query-adaptive video summarization via quality-aware relevance estimation,” in MM, 2017, pp. 582–590.
  2. “Query-controllable video summarization,” in ICMR, 2020, pp. 242–250.
  3. “Gpt2mvs: Generative pre-trained transformer-2 for multi-modal video summarization,” in ICMR, 2021, pp. 580–589.
  4. “Tvsum: Summarizing web videos using titles,” in CVPR, 2015, pp. 5179–5187.
  5. “Creating summaries from user videos,” in ECCV. Springer, 2014, pp. 505–520.
  6. “Unsupervised visual representation learning by context prediction,” in ICCV, 2015, pp. 1422–1430.
  7. “Self-supervised learning by cross-modal audio-video clustering,” arXiv preprint arXiv:1911.12667, 2019.
  8. “Self-supervised learning for video correspondence flow,” arXiv preprint arXiv:1905.00875, 2019.
  9. “Self-supervised video representation learning with space-time cubic puzzles,” in AAAI, 2019, vol. 33, pp. 8545–8552.
  10. “Deep clustering for unsupervised learning of visual features,” in ECCV, 2018, pp. 132–149.
  11. “Video summarization with long short-term memory,” in ECCV. Springer, 2016, pp. 766–782.
  12. “Hierarchical recurrent neural network for video summarization,” in MM, 2017, pp. 863–871.
  13. “Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization,” in CVPR, 2018, pp. 7405–7414.
  14. “Joint video summarization and moment localization by cross-task sample transfer,” in CVPR, 2022, pp. 16388–16398.
  15. “Causal video summarizer for video exploration,” in 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6.
  16. “Causalainer: Causal explainer for automatic video summarization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2629–2635.
  17. “Video summarization by learning deep side semantic embedding,” Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, 2017.
  18. “Video summarisation by classification with deep reinforcement learning,” arXiv:1807.03089, 2018.
  19. “Expert-defined keywords improve interpretability of retinal image captioning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1859–1868.
  20. “The dawn of quantum natural language processing,” ICASSP, 2022.
  21. “A novel hybrid machine learning model for auto-classification of retinal diseases,” Workshop on Computational Biology, ICML, 2018.
  22. “Synthesizing new retinal symptom images by multiple generative models,” in ACCV. Springer, 2018, pp. 235–250.
  23. “Auto-classification of retinal diseases in the limit of sparse data using a two-streams machine learning model,” in ACCV. Springer, 2018, pp. 323–338.
  24. “Longer version for” deep context-encoding network for retinal image captioning”,” arXiv preprint arXiv:2105.14538, 2021.
  25. “Deep context-encoding network for retinal image captioning,” in ICIP. IEEE, 2021, pp. 3762–3766.
  26. “Contextualized keyword representations for multi-modal retinal image captioning,” in ICMR, 2021, pp. 645–652.
  27. “Non-local attention improves description generation for retinal images,” in WACV, 2022, pp. 1606–1615.
  28. “Deepopht: medical report generation for retinal images via deep models and visual explanation,” in WACV, 2021, pp. 2442–2452.
  29. “Assessing the robustness of visual question answering,” arXiv preprint arXiv:1912.01452, 2019.
  30. “Improving visual question answering models through robustness analysis and in-context learning with a chain of basic questions,” arXiv preprint arXiv:2304.03147, 2023.
  31. Jia-Hong Huang, “Robustness analysis of visual question answering models by basic questions,” King Abdullah University of Science and Technology, Master Thesis, 2017.
  32. “Robustness analysis of visual qa models by basic questions,” VQA Challenge and Visual Dialog Workshop, CVPR, 2018.
  33. “Vqabq: Visual question answering by basic questions,” VQA Challenge Workshop, CVPR, 2017.
  34. “A novel framework for robustness analysis of visual qa models,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 8449–8456.
  35. “Weakly supervised summarization of web videos,” in ICCV, 2017, pp. 3657–3666.
  36. “Summarizing first-person videos from third persons’ points of view,” in ECCV, 2018, pp. 70–85.
  37. “Weakly-supervised video summarization using variational encoder-decoder and web prior,” in ECCV, 2018, pp. 184–200.
  38. “Silco: Show a few images, localize the common object,” in ICCV, 2019, pp. 5067–5076.
  39. “Weakly supervised video summarization by hierarchical reinforcement learning,” in MM Asia, pp. 1–6. 2019.
  40. “Self-supervised learning to detect key frames in videos,” Sensors, vol. 20, no. 23, pp. 6941, 2020.
  41. “Comprehensive video understanding: Video summarization with content-based video recommender design,” in ICCVW, 2019, pp. 0–0.
  42. Kawin Ethayarajh, “How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings,” arXiv preprint arXiv:1909.00512, 2019.
  43. “Distributed representations of words and phrases and their compositionality,” arXiv preprint arXiv:1310.4546, 2013.
  44. “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
  45. “Linguistic regularities in sparse and explicit word representations,” in Proceedings of the eighteenth conference on computational natural language learning, 2014, pp. 171–180.
  46. “Neural word embedding as implicit matrix factorization,” NIPS, vol. 27, pp. 2177–2185, 2014.
  47. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  48. “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.
  49. “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
  50. “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
  51. “Analysis on noise reduction method for interferometric sar image,” in IGARSS, 2004, vol. 6, pp. 4243–4246.
  52. “Agreement, the f-measure, and reliability in information retrieval,” Journal of the American medical informatics association, vol. 12, no. 3, pp. 296–298, 2005.
  53. “Spatiotemporal modeling and label distribution learning for video summarization,” in MMSP, 2019, pp. 1–6.
  54. “Stacked memory network for video summarization,” in MM, 2019, pp. 836–844.
  55. “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
  56. “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.
Citations (6)

Summary

We haven't generated a summary for this paper yet.