Rescaling Egocentric Vision (2006.13256v4)

Published 23 Jun 2020 in cs.CV and cs.LG

Abstract: This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the "test of time" - i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics

Citations (377)

View on Semantic Scholar

Summary

The paper introduces a 'pause-and-talk' annotation method that enhances data quality while reducing cognitive load during video narration.
The paper demonstrates a balance between maintaining scalability and improving label accuracy, addressing critical domain adaptation challenges.
The paper validates its approach through detailed analysis of modality adaptations to visual domain shifts, advancing action recognition research.

Overview of Enhancements in the EPIC-KITCHENS-100 Dataset

The reviewed paper addresses critical developments and refinements in the EPIC-KITCHENS dataset, expanding from its initial EPIC-KITCHENS-55 version to the comprehensive EPIC-KITCHENS-100. The primary contribution lies in the methodological improvements in data annotation and domain adaptation challenges, crucial for advancing the field of computer vision with specific applications to egocentric video datasets. This enhancement is achieved through a "pause-and-talk" annotation approach, which intends to improve both the density and accuracy of narrations while maintaining scalability.

Key Points of Contribution

Annotation Methodology: The transition from a "non-stop" to a "pause-and-talk" narration represents a pivotal development in data collection strategies. This methodological shift mitigates the cognitive load on participants by allowing annotations to be recorded while the video is paused, which potentially increases the quality of annotations by reducing errors associated with simultaneous task performance.
Scalability and Quality Balance: Despite primary skepticism regarding scalability sacrifices for accuracy enhancements, authors have argued that their approach maintains scalability with improved label quality. This is highlighted in addressing potential misunderstandings by revising terminologies and descriptions throughout the paper.
Domain Adaptation: The dataset introduces a novel split for unsupervised domain adaptation tasks. This split challenges models with domain shifts incurred due to temporal changes, differing recording equipment, and variations in the physical setting. Such considerations are pivotal in testing the robustness of action recognition models against environmental and temporal variability.
Visual Domain Characteristics: A rigorous examination of the domain gap is provided, discussing how the variance in visual inputs from two different dataset periods influences model performance. Insights are given into the adaptability of modalities (RGB, Flow, Audio) to these domain gaps, with quantitative metrics substantiating such analysis.
Practical Implications: The paper underscores the application of enhanced annotation methodologies in training data scalability and presents potential remedies for model overfitting to specific domains. This is pertinent when adapting models for environments with naturally high variability, which are typical of egocentric video datasets like those in culinary contexts.

Implications and Future Directions

The enhancements in the EPIC-KITCHENS-100 dataset contribute significantly to the field of computer vision, providing a framework through which models can be trained and adapted to dynamic, real-world environments. The introduction of the "pause-and-talk" approach may redefine standard practices in video data annotation, emphasizing the need for cognitive load considerations.

Additionally, the domain adaptation challenges embedded within this dataset serve as a critical testing ground for validating the generalization capabilities of models across temporally and contextually diverse environments. As models capable of anticipating actions in egocentric videos continue to mature, the dataset’s complexity will facilitate the development of more sophisticated models capable of better understanding human-object interactions in varying contexts.

Future research could explore deeper integration of multimodal data analysis to further combat domain gaps and enhance action anticipation accuracy. Extended datasets could incorporate more diverse environments and participants, enriching the empirical sampling and alleviating biases inherent in domain-specific datasets.

In conclusion, the enhancements made in the EPIC-KITCHENS-100 dataset not only refine existing data collection methodologies but also pose substantial challenges to the current state of domain adaptation techniques within egocentric vision research. This lays the groundwork for subsequent advances in the development of robust, adaptive action recognition models.

PDF Markdown

Related Papers

GitHub

GitHub - epic-kitchens/epic-kitchens-100-narrator: Video narrator written in Python/GTK using vlc-lib (25 stars)