UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild (1212.0402v1)

Published 3 Dec 2012 in cs.CV

Abstract: We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.

Citations (5,738)

View on Semantic Scholar

Summary

The paper introduces UCF101, expanding action recognition research by offering 101 diverse human action classes and 13,320 videos from unconstrained settings.
It details a baseline evaluation using Harris3D features and a 4000-word codebook with a nonlinear SVM, achieving 44.5% overall accuracy.
The dataset’s varied challenges, including camera motion and occlusions, provide a robust benchmark for advancing deep learning models in action recognition.

An Overview of UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild

The paper "UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild" by Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah from the Center for Research in Computer Vision, University of Central Florida, presents the UCF101 dataset. This dataset is an extension of previous datasets such as UCF50 and UCF Sports, and it aims to address existing challenges in human action recognition through scale and variety.

Introduction and Motivation

Existing action recognition datasets suffer from two primary limitations: low class counts and controlled recording environments. For example, the KTH and Weizmann datasets include only 6 and 9 action classes, respectively, and are recorded in highly controlled settings. UCF101 mitigates these limitations by incorporating 101 action classes and 13,320 clips, extending the scope and complexity for action recognition research.

Dataset Composition

UCF101 features videos sourced from unconstrained environments on the web, mainly YouTube. The dataset includes a wide range of challenges such as camera motion, varying lighting conditions, occlusions, and different levels of video quality. There are 101 action classes which are further categorized into five types:

Human-Object Interaction
Body-Motion Only
Human-Human Interaction
Playing Musical Instruments
Sports

The dataset significantly extends UCF50 by nearly doubling the number of actions and clips, making it the largest action recognition dataset available in terms of the number of classes and clips.

Detailed Dataset Characterization

The dataset contains a broad range of video characteristics:

Mean clip length of 7.21 seconds
Fixed resolution and frame rate of 320x240 pixels and 25 FPS, respectively
Presence of audio for the 51 new actions

A key structural feature is the division of clips into groups, where each action class is subdivided into 25 groups containing 4-7 clips each, aiding in balanced cross-validation experiments.

Baseline Evaluation

The authors provide baseline action recognition results using a standard bag of words approach:

Feature Extraction: Harris3D corners were extracted, and 162-dimensional HOG/HOF descriptors were computed for the interest points.
Codebook Construction: Using k-means clustering on 100,000 STIP, a codebook of 4000 visual words was generated.
Classification: A nonlinear multiclass SVM with histogram intersection kernel was employed for classification.

The overall accuracy achieved was 44.5%. Performance varied across different action types, with the Sports category achieving the highest accuracy (50.54%) and Human-Object Interaction performing the lowest (38.52%), likely due to the complexity and clutter present in these videos.

Implications and Future Directions

The introduction of UCF101 represents a significant advancement for the action recognition community, providing a comprehensive and challenging dataset that supports the robustness and generalization of action recognition methods. The variety in actions and the complexity inherent in real-world videos enhance the utility of UCF101 for developing more sophisticated and accurate recognition algorithms.

While the baseline performance indicates room for improvement, UCF101 sets a new standard for the scope of datasets needed to train and evaluate advanced models. Given the emerging trends in deep learning and enhanced feature extraction techniques, future research can leverage this dataset to push the boundaries of action recognition. The application of more recent architectures like convolutional and recurrent neural networks might yield substantial performance improvements, reducing the limitations highlighted in the baseline results.

As the action recognition field progresses, datasets like UCF101 will continue to play a crucial role, enabling the deployment of robust models in applications such as surveillance, human-computer interaction, and automated video annotation. The availability of this dataset stands as an essential resource for continued exploration and innovation in understanding human actions from video data.

PDF Markdown