The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Published 29 Apr 2020 in cs.CV | (2005.00343v1)

Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos after recording, thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and. anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions e.g. 'closing a tap' from 'opening' it up.

Abstract PDF Upgrade to Chat

Authors (11)

Citations (195)

View on Semantic Scholar

Summary

The paper introduces the EPIC-KITCHENS dataset as a large-scale first-person video benchmark enriched with narrated annotations to capture natural human-object interactions.
The paper details robust methodologies for action recognition, anticipation, and object detection, presenting baselines with TSN and Faster R-CNN that highlight current performance challenges.
The paper underscores challenges in predicting actions in unscripted kitchen environments, emphasizing the need for enhanced temporal reasoning and effective multimodal fusion.

An Overview of the EPIC-KITCHENS Dataset: Collection, Challenges, and Baselines

The EPIC-KITCHENS dataset represents a significant contribution to the field of egocentric vision by introducing the largest benchmark of first-person video recordings. Captured in naturalistic settings, the dataset offers an extensive collection that facilitates the analysis of human-object interactions, intention recognition, and anticipatory modeling. This paper details the methodology of dataset compilation, the subsequent challenges it presents, and outlines performance baselines for several key computer vision tasks.

Key Aspects of the Dataset

EPIC-KITCHENS comprises 55 hours of video across 11.5 million frames, recorded by 32 participants in their native kitchen environments in four countries. The data collection procedure was designed to capture unscripted, natural interactions, reflecting the diversity of cooking habits influenced by geographical and cultural backgrounds. Notably, the dataset includes 39.6K annotated action segments and 454.2K object bounding boxes.

An innovative aspect of the dataset is the inclusion of participant-narrated annotations. Participants provided verbal descriptions of their activities post-recording, which were subsequently transcribed and used to generate ground-truth labels. This narrative approach captures genuine intention and contextualizes actions within the recordings.

Challenges and Baseline Evaluations

The paper introduces several computational challenges using the dataset: action recognition, action anticipation, and object detection. The challenges are structured to evaluate performance in seen and unseen kitchen environments, emphasizing the adaptability of models to previously unobserved contexts.

Action Recognition: This challenge involves classifying verb-noun pairings derived from the annotated sequences. The baselines established use Temporal Segment Networks (TSN) and explore different modalities like RGB, optical flow, and audio. The findings highlight that implicit temporal modeling and multimodal fusion improve action recognition accuracy, although substantial room for improvement remains.
Action Anticipation: For anticipatory modeling, models need to predict an action before it starts. Methods evaluated include encoder-decoder architectures and deep multimodal regression strategies, revealing the complexities in predicting future actions within egocentric settings.
Object Detection: Utilizing Faster R-CNN, this challenge benchmarks the ability to identify and localize objects with varying frequencies of occurrence. Findings indicate significant challenges, especially in detecting infrequent or small-sized objects, pointing towards future research directions in fine-tuning object detection from limited egocentric data.

Implications and Future Directions

The EPIC-KITCHENS dataset opens avenues for advancing state-of-the-art models in egocentric vision. Its scale and diversity allow for the development and testing of algorithms capable of understanding complex human behaviors in natural settings, a crucial requirement for real-world applications in robotics, assistive technologies, and smart environment interfaces.

Future work should focus on improving temporal reasoning capabilities to better model and anticipate human actions, potentially integrating richer semantic understanding through unsupervised learning. Additionally, exploring routine modeling and the subtle nuances of skill analysis from prolonged video sequences stand out as promising research areas.

In conclusion, EPIC-KITCHENS not only provides a robust benchmark for existing challenges but also stimulates research into new paradigms of video understanding that align closely with human cognitive processes, ultimately bridging the gap between machine perception and human intentionality.

Markdown Report Issue