Learning Spatiotemporal Features via Video and Text Pair Discrimination (2001.05691v3)

Published 16 Jan 2020 in cs.CV

Abstract: Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embeddings. The code will be made available at https://github.com/MCG-NJU/CPD-Video.

Citations (54)

View on Semantic Scholar

Summary

Learning Spatiotemporal Features via Video and Text Pair Discrimination

The research paper presents an innovative approach to address the challenge of learning spatiotemporal features from videos without requiring extensive manual annotation. The paper leverages the natural presence of textual information accompanying videos, such as titles or captions on popular social media platforms like YouTube and Instagram, to achieve this through a weakly-supervised manner. The work introduces the Cross-Modal Pair Discrimination (CPD) framework, designed to exploit the correlation between videos and their relevant text-based information.

Framework and Methodology

The CPD framework is grounded in the principle of linking video content with corresponding textual data to aid feature learning. The cornerstone of the approach is the use of noise-contrastive estimation, which facilitates the handling of extensive data by efficiently approximating large numbers of video-text pair classes. The framework employs a curriculum learning strategy to cope with the noisy and uncurated nature of the video-text dataset, thus optimizing the feature extraction process.

The CPD model involves embedding functions tailored for each modality—visual and textual—which map these inputs into a shared feature space. The training process is divided into stages that initially keep the pre-trained text model intact, ensuring that early visual model perturbations do not corrupt the textual data processing. Once a stable visual model is developed, joint training begins, further refining the model's ability to discern and encode meaningful video representations.

Experimental Setup

This paper utilizes both standard curated datasets (Kinetics-210k) and uncurated web datasets (Instagram-300k) for training. Evaluation is conducted on action recognition benchmarks like UCF101 and HMDB51, where the CPD-trained models demonstrate significant improvements. Notably, without additional fine-tuning, these models achieve competitive results, providing a robust initialization for downstream tasks that lead to superior performance gains against state-of-the-art self-supervised techniques. Especially noteworthy is the fact that the CPD framework achieves comparable performance to methodologies using far larger datasets, highlighting its computational efficiency.

Implications and Future Work

The implications of this research extend to practical applications in scenarios where computational resources are limited or annotated datasets are scarce. The CPD approach reduces dependency on extensive manual data preparation, suggesting a method for efficient model training using existing multimedia data enriched by associated text.

Future research directions could explore integrating more sophisticated text processing techniques, enhancing textual noise filtering, or extending the CPD model across varied multimedia data types beyond video. Additionally, understanding the intricacies of cross-modal learning further might pave the way for more generalized applications, influencing developments in AI that seek to leverage organically available data.

This paper provides a significant step toward enabling more accessible and efficient learning of spatiotemporal features, demonstrating meaningful cross-modal supervision's potential in action recognition tasks.

Related Papers

GitHub

GitHub - MCG-NJU/CPD-Video: Learning Spatiotemporal Features via Video and Text Pair Discrimination (59 stars)