SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks (2401.17773v1)
Abstract: We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.
- Noise estimation using density estimation for self-supervised multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6644–6652, 2021.
- Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pages 5803–5812, 2017.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pages 1728–1738, 2021.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June 2011.
- E-commerce storytelling recommendation using attentional domain-transfer network and adversarial pre-training. IEEE Transactions on Multimedia, 24:506–518, 2022.
- Comphy: Compositional physical reasoning of objects and events from videos. arXiv preprint arXiv:2205.01089, 2022.
- Pre-training with whole word masking for chinese bert. IEEE Transactions on Audio, Speech, and Language Processing, 29:3504–3514, 2021.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5680–5694, 2022.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
- Temporal multimodal graph transformer with global-local alignment for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 33(3):1438–1453, 2023.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16167–16176, 2022.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
- Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9972–9981, 2020.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
- Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):5944–5958, 2022.
- Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4953–4963, 2022.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
- Uni-eden: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Trans. Multimedia Comput. Commun. Appl., 18(2), feb 2022.
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755. Springer, 2014.
- Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487, 2019.
- Video swin transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860, 2021.
- Coco-bert: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the ACM International Conference on Multimedia, pages 5600–5608, 2021.
- Video saliency forecasting transformer. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6850–6862, 2022.
- Tevl: Trilinear encoder for video-language representation learning. ACM Trans. Multimedia Comput. Commun. Appl., feb 2023. Just Accepted.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017.
- Look before you speak: Visually contextualized utterances. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16877–16887, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 2556–2565, 2018.
- Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, 24:2914–2923, 2022.
- Object-aware video-language pre-training for retrieval. arXiv preprint arXiv:2112.00656, 2021.
- Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE International Conference on Computer Vision, pages 568–578, 2021.
- Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- A optimized bert for multimodal sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl., 19(2s), feb 2023.
- A robust passage retrieval algorithm for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, 18(10):1411–1421, 2008.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, pages 305–321, 2018.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM International Conference on Multimedia, pages 1645–1653, 2017.
- Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016.
- Bridging video and text: A two-step polishing transformer for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):6293–6307, 2022.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 1686–1697, 2021.
- Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 11562–11572, 2021.
- Self-training vision language berts with a unified conditional model. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023.
- Text2video: An end-to-end learning framework for expressing text with videos. IEEE Transactions on Multimedia, 20(9):2360–2370, 2018.
- Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019.
- Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Transactions on Circuits and Systems for Video Technology, 31(3):931–944, 2021.
- A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision, pages 471–487, 2018.
- Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
- Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
- Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):63–74, 2022.
- Actbert: Learning global-local video-text representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8746–8755, 2020.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pages 19–27, 2015.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.