Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SOVC: Subject-Oriented Video Captioning (2312.13330v2)

Published 20 Dec 2023 in cs.CV

Abstract: Describing video content according to users' needs is a long-held goal. Although existing video captioning methods have made significant progress, the generated captions may not focus on the entity that users are particularly interested in. To address this problem, we propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box. To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets: MSVD and MSRVTT, by annotating subjects in each video for each caption. These datasets pave the way for describing users' interested targets. To tackle this task, we introduce a method tailored to this task, named SOVCNet. It consists of two key components: a subject-oriented sampling module that samples frames related to the subject to minimize irrelevant information; and a subject-oriented encoding module that utilizes the subject areas as hard prompts and integrates learnable soft prompts, enhancing the model's focus on the subject's activities and facilitating adaptation to the downstream generation task. Extensive experimental results demonstrate the effectiveness of our method on this new task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop, pages 65–72, 2005.
  2. Collecting highly parallel data for paraphrase evaluation. In ACL, pages 190–200, 2011.
  3. Human-like controllable image captioning with verb-specific semantic roles. In CVPR, pages 16846–16856, 2021.
  4. Show, control and tell: A framework for generating controllable and grounded captions. In CVPR, pages 8307–8316, 2019.
  5. Text with knowledge graph augmented transformer for video captioning. In CVPR, pages 18941–18951, 2023.
  6. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. CoRR, abs/2103.06561, 2021.
  7. LAVENDER: unifying video-language understanding as masked language modeling. In CVPR, pages 23119–23129, 2023.
  8. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, pages 605–612, 2004.
  9. Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, pages 17928–17937, 2022.
  10. Exploring group video captioning with efficient relational approximation. In ICCV, pages 15281–15290, 2023.
  11. O2NA: an object-oriented non-autoregressive approach for controllable video captioning. In ACL/IJCNLP, pages 281–292, 2021.
  12. Video swin transformer. In CVPR, pages 3192–3201, 2022.
  13. Search-oriented micro-video captioning. In ACM MM, pages 3234–3243, 2022.
  14. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR, pages 10867–10876, 2020.
  15. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
  16. Memory-attended recurrent network for video captioning. In CVPR, pages 8347–8356, 2019.
  17. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  18. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  19. Semantic grouping network for video captioning. In AAAI, pages 2514–2522. AAAI Press, 2021.
  20. End-to-end generative pretraining for multimodal video captioning. In CVPR, pages 17938–17947, 2022.
  21. Accurate and fast compressed video captioning. In ICCV, pages 15558–15567, 2023.
  22. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  23. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
  24. Sequence to sequence - video to text. In ICCV, pages 4534–4542, 2015.
  25. Controllable video captioning with POS sequence guidance based on gated fusion network. In ICCV, pages 2641–2650, 2019.
  26. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
  27. GL-RG: global-local representation granularity for video captioning. In IJCAI, pages 2769–2775, 2022.
  28. Non-autoregressive coarse-to-fine video captioning. In AAAI, pages 3119–3127, 2021.
  29. Visual captioning at will: Describing images and videos guided by a few stylized sentences. In ACM MM, pages 5705–5715, 2023.
  30. All up to you: Controllable video captioning with a masked scene graph. In PRICAI, pages 325–337, 2022.
  31. Describing videos by exploiting temporal structure. In ICCV, pages 4507–4515, 2015.
  32. Hierarchical modular network for video captioning. In CVPR, pages 17918–17927, 2022.
  33. Controllable video captioning with an exemplar sentence. In ACM MM, pages 1085–1093, 2020.
  34. Syntax customized video captioning by imitating exemplar sentences. TPAMI, 44:10209–10221, 2022.
  35. Comprehensive information integration modeling framework for video titling. In KDD, pages 2744–2754. ACM, 2020a.
  36. Object relational graph with teacher-recommended learning for video captioning. In CVPR, pages 13275–13285, 2020b.
  37. Intention oriented image captions with guiding objects. In CVPR, pages 8395–8404, 2019.
  38. Refined semantic enhancement towards frequency diffusion for video captioning. In AAAI, pages 3724–3732, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yunchuan Ma (2 papers)
  2. Chang Teng (1 paper)
  3. Yuankai Qi (46 papers)
  4. Guorong Li (36 papers)
  5. Laiyu Qing (1 paper)
  6. Qingming Huang (168 papers)

Summary

We haven't generated a summary for this paper yet.