SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks (2401.17773v1)
Abstract: We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.
- Noise estimation using density estimation for self-supervised multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6644–6652, 2021.
- Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pages 5803–5812, 2017.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pages 1728–1738, 2021.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June 2011.
- E-commerce storytelling recommendation using attentional domain-transfer network and adversarial pre-training. IEEE Transactions on Multimedia, 24:506–518, 2022.
- Comphy: Compositional physical reasoning of objects and events from videos. arXiv preprint arXiv:2205.01089, 2022.
- Pre-training with whole word masking for chinese bert. IEEE Transactions on Audio, Speech, and Language Processing, 29:3504–3514, 2021.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5680–5694, 2022.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
- Temporal multimodal graph transformer with global-local alignment for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 33(3):1438–1453, 2023.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16167–16176, 2022.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
- Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9972–9981, 2020.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
- Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):5944–5958, 2022.
- Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4953–4963, 2022.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
- Uni-eden: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Trans. Multimedia Comput. Commun. Appl., 18(2), feb 2022.
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755. Springer, 2014.
- Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487, 2019.
- Video swin transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860, 2021.
- Coco-bert: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the ACM International Conference on Multimedia, pages 5600–5608, 2021.
- Video saliency forecasting transformer. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6850–6862, 2022.
- Tevl: Trilinear encoder for video-language representation learning. ACM Trans. Multimedia Comput. Commun. Appl., feb 2023. Just Accepted.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017.
- Look before you speak: Visually contextualized utterances. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16877–16887, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 2556–2565, 2018.
- Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, 24:2914–2923, 2022.
- Object-aware video-language pre-training for retrieval. arXiv preprint arXiv:2112.00656, 2021.
- Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE International Conference on Computer Vision, pages 568–578, 2021.
- Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- A optimized bert for multimodal sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl., 19(2s), feb 2023.
- A robust passage retrieval algorithm for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, 18(10):1411–1421, 2008.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, pages 305–321, 2018.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM International Conference on Multimedia, pages 1645–1653, 2017.
- Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016.
- Bridging video and text: A two-step polishing transformer for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):6293–6307, 2022.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 1686–1697, 2021.
- Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 11562–11572, 2021.
- Self-training vision language berts with a unified conditional model. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023.
- Text2video: An end-to-end learning framework for expressing text with videos. IEEE Transactions on Multimedia, 20(9):2360–2370, 2018.
- Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019.
- Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Transactions on Circuits and Systems for Video Technology, 31(3):931–944, 2021.
- A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision, pages 471–487, 2018.
- Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
- Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
- Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):63–74, 2022.
- Actbert: Learning global-local video-text representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8746–8755, 2020.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pages 19–27, 2015.