End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Published 13 Dec 2019 in cs.CV | (1912.06430v4)

Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (670)

View on Semantic Scholar

Summary

The paper introduces MIL-NCE loss to effectively align video and narration data without manual annotations.
It proposes a joint video-text embedding method that learns robust representations directly from raw HowTo100M videos.
Evaluations across action recognition, retrieval, and segmentation tasks demonstrate state-of-the-art performance.

Overview of "End-to-End Learning of Visual Representations from Uncurated Instructional Videos"

This paper investigates the challenge of learning visual representations from uncurated and narrated instructional videos without deploying any manually annotated datasets. The research presents a methodology utilizing Multiple Instance Learning and Noise Contrastive Estimation (MIL-NCE) to address misalignments in video narrations. This approach allows the creation of robust video representations directly from scratch by leveraging the vast, unannotated HowTo100M dataset comprising narrated videos.

Main Contributions

MIL-NCE Loss: The paper introduces a novel MIL-NCE loss that deals with the misalignment of visual and textual data commonly observed in narrated videos. This loss combines concepts from Multiple Instance Learning (MIL) and Noise Contrastive Estimation (NCE) to effectively train the models despite the noisy and weak supervision inherent in uncurated instructional video datasets.
Joint Video-Text Embedding: The proposed method efficiently learns a joint embedding space for video and text data, enabling semantically similar clips and narrations to be closely aligned in this space. The method is novel in that it learns these embeddings from the raw video and narration inputs, without relying on any pre-processed or annotated datasets.
Evaluation Across Diverse Tasks: The paper assesses the learned representations across a variety of downstream tasks, including action recognition, text-to-video retrieval, action localization, and action segmentation spread over eight datasets. The results showcase the method's ability to outperform not only self-supervised approaches but also several fully supervised baselines.

Evaluation and Results

The evaluation spans across four major tasks utilizing established benchmarks:
1. Action Recognition: Applied on HMDB-51, UCF-101, and Kinetics-700 datasets. The learned representations outperformed fully supervised baselines even without fine-tuning.
2. Text-to-Video Retrieval: Evaluated on YouCook2 and MSR-VTT datasets, demonstrating the model's strong retrieval capabilities without any additional training on these datasets.
3. Action Localization and Segmentation: The model was tested on YouTube-8M Segments and CrossTask, where it achieved state-of-the-art results despite the challenging temporal alignment required.

Methodology

The MIL-NCE approach leverages a set of possible candidate pairs for training, improving the association between video clips and their corresponding narrations. It advances beyond traditional methods by considering multiple positive samples, thereby increasing the likelihood of capturing the correct alignments within noisy data. The method also emphasizes symmetry in selecting negative samples to boost discriminative efficiency.

Implications and Future Developments

The implications of this research are significant for the scalability of model training in computer vision domains. By eliminating the need for extensive manual annotation, this method opens avenues for utilizing large uncurated datasets more effectively. Future directions in AI may center around refining MIL-NCE mechanisms, enhancing the robustness of joint embeddings, and exploring other uncurated data sources. This approach offers a promising path forward for advancing self-supervised learning techniques in video understanding and extending to other multimedia data modalities.

In summary, the research presents a compelling framework for end-to-end learning of visual representations that addresses the challenges of misalignment and noise in instructional video narrations, contributing valuable insights and performance enhancements across multiple application areas in AI.

Markdown Report Issue