Query-based Video Summarization with Pseudo Label Supervision (2307.01945v1)
Abstract: Existing datasets for manually labelled query-based video summarization are costly and thus small, limiting the performance of supervised deep video summarization models. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model both the relationship between a pretext task and a target task, and the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Furthermore, we propose mutual attention to help capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
- “Query-adaptive video summarization via quality-aware relevance estimation,” in MM, 2017, pp. 582–590.
- “Query-controllable video summarization,” in ICMR, 2020, pp. 242–250.
- “Gpt2mvs: Generative pre-trained transformer-2 for multi-modal video summarization,” in ICMR, 2021, pp. 580–589.
- “Tvsum: Summarizing web videos using titles,” in CVPR, 2015, pp. 5179–5187.
- “Creating summaries from user videos,” in ECCV. Springer, 2014, pp. 505–520.
- “Unsupervised visual representation learning by context prediction,” in ICCV, 2015, pp. 1422–1430.
- “Self-supervised learning by cross-modal audio-video clustering,” arXiv preprint arXiv:1911.12667, 2019.
- “Self-supervised learning for video correspondence flow,” arXiv preprint arXiv:1905.00875, 2019.
- “Self-supervised video representation learning with space-time cubic puzzles,” in AAAI, 2019, vol. 33, pp. 8545–8552.
- “Deep clustering for unsupervised learning of visual features,” in ECCV, 2018, pp. 132–149.
- “Video summarization with long short-term memory,” in ECCV. Springer, 2016, pp. 766–782.
- “Hierarchical recurrent neural network for video summarization,” in MM, 2017, pp. 863–871.
- “Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization,” in CVPR, 2018, pp. 7405–7414.
- “Joint video summarization and moment localization by cross-task sample transfer,” in CVPR, 2022, pp. 16388–16398.
- “Causal video summarizer for video exploration,” in 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6.
- “Causalainer: Causal explainer for automatic video summarization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2629–2635.
- “Video summarization by learning deep side semantic embedding,” Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, 2017.
- “Video summarisation by classification with deep reinforcement learning,” arXiv:1807.03089, 2018.
- “Expert-defined keywords improve interpretability of retinal image captioning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1859–1868.
- “The dawn of quantum natural language processing,” ICASSP, 2022.
- “A novel hybrid machine learning model for auto-classification of retinal diseases,” Workshop on Computational Biology, ICML, 2018.
- “Synthesizing new retinal symptom images by multiple generative models,” in ACCV. Springer, 2018, pp. 235–250.
- “Auto-classification of retinal diseases in the limit of sparse data using a two-streams machine learning model,” in ACCV. Springer, 2018, pp. 323–338.
- “Longer version for” deep context-encoding network for retinal image captioning”,” arXiv preprint arXiv:2105.14538, 2021.
- “Deep context-encoding network for retinal image captioning,” in ICIP. IEEE, 2021, pp. 3762–3766.
- “Contextualized keyword representations for multi-modal retinal image captioning,” in ICMR, 2021, pp. 645–652.
- “Non-local attention improves description generation for retinal images,” in WACV, 2022, pp. 1606–1615.
- “Deepopht: medical report generation for retinal images via deep models and visual explanation,” in WACV, 2021, pp. 2442–2452.
- “Assessing the robustness of visual question answering,” arXiv preprint arXiv:1912.01452, 2019.
- “Improving visual question answering models through robustness analysis and in-context learning with a chain of basic questions,” arXiv preprint arXiv:2304.03147, 2023.
- Jia-Hong Huang, “Robustness analysis of visual question answering models by basic questions,” King Abdullah University of Science and Technology, Master Thesis, 2017.
- “Robustness analysis of visual qa models by basic questions,” VQA Challenge and Visual Dialog Workshop, CVPR, 2018.
- “Vqabq: Visual question answering by basic questions,” VQA Challenge Workshop, CVPR, 2017.
- “A novel framework for robustness analysis of visual qa models,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 8449–8456.
- “Weakly supervised summarization of web videos,” in ICCV, 2017, pp. 3657–3666.
- “Summarizing first-person videos from third persons’ points of view,” in ECCV, 2018, pp. 70–85.
- “Weakly-supervised video summarization using variational encoder-decoder and web prior,” in ECCV, 2018, pp. 184–200.
- “Silco: Show a few images, localize the common object,” in ICCV, 2019, pp. 5067–5076.
- “Weakly supervised video summarization by hierarchical reinforcement learning,” in MM Asia, pp. 1–6. 2019.
- “Self-supervised learning to detect key frames in videos,” Sensors, vol. 20, no. 23, pp. 6941, 2020.
- “Comprehensive video understanding: Video summarization with content-based video recommender design,” in ICCVW, 2019, pp. 0–0.
- Kawin Ethayarajh, “How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings,” arXiv preprint arXiv:1909.00512, 2019.
- “Distributed representations of words and phrases and their compositionality,” arXiv preprint arXiv:1310.4546, 2013.
- “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
- “Linguistic regularities in sparse and explicit word representations,” in Proceedings of the eighteenth conference on computational natural language learning, 2014, pp. 171–180.
- “Neural word embedding as implicit matrix factorization,” NIPS, vol. 27, pp. 2177–2185, 2014.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.
- “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
- “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
- “Analysis on noise reduction method for interferometric sar image,” in IGARSS, 2004, vol. 6, pp. 4243–4246.
- “Agreement, the f-measure, and reliability in information retrieval,” Journal of the American medical informatics association, vol. 12, no. 3, pp. 296–298, 2005.
- “Spatiotemporal modeling and label distribution learning for video summarization,” in MMSP, 2019, pp. 1–6.
- “Stacked memory network for video summarization,” in MM, 2019, pp. 836–844.
- “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
- “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.