FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding (2004.06704v1)

Published 14 Apr 2020 in cs.CV

Abstract: On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnastic videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jump-hop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigate representative methods on this dataset and obtain a number of interesting findings. We hope this dataset could advance research towards action understanding.

Citations (281)

View on Semantic Scholar

Summary

The paper introduces FineGym with a novel hierarchical annotation scheme capturing events, sets, and elements for precise action parsing.
It leverages high-resolution gymnastics competition videos to provide over 32,000 sub-action instances, addressing subtle temporal dynamics.
Empirical results reveal that conventional models struggle with FineGym's fine-grained details, highlighting the need for advanced spatio-temporal methods.

Hierarchical Video Dataset for Fine-grained Action Understanding

The paper "FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding" introduces a novel dataset specifically designed to address the challenges associated with fine-grained action recognition, particularly in the context of gymnastics. Existing datasets predominantly capture coarse-grained action categories which often rely on contextual information like background settings to aid categorization. FineGym, however, is crafted to support detailed action parsing by providing granular annotations at multiple semantic and temporal levels.

FineGym distinguishes itself through its combinatorial hierarchy approach, encapsulating actions at three semantic levels: event, set, and element. This dataset is built upon high-resolution videos from professional gymnastics competitions, ensuring high quality and relevance to real-world applications such as sports analytics. The temporal aspect is equally nuanced, with annotations provided at both the action and sub-action levels. This granularity enables researchers to decode the complex temporal structures underlying gymnastics routines and to differentiate between subtleties of movements.

The dataset construction process reflects the complexity and demands of the task. A multi-faceted approach was implemented that involved strategic data collection from official gymnastics competition recordings, a structured hierarchical categorization scheme drawing from expert knowledge resources, and detailed annotation protocols using decision trees for precision. As a testament to its robustness, FineGym includes over 32,000 sub-action instances across 530 uniquely classified elements.

In terms of empirical evaluation, the paper underscores the limitations of current recognition techniques when subjected to the high-resolution demands of FineGym. For example, methods like TSN, TRN, TSM, I3D, and ST-GCN, which have seen success on coarse-grained datasets, struggle with the nuanced actions in FineGym. Notable challenges include handling intense motion dynamics and discriminating subtle semantic differences between action elements. The authors further investigate common modeling practices such as frame sampling rates and temporal modeling schemes, demonstrating that more frames and advanced temporal reasoning significantly affect fine-grained action recognition performance.

A significant emphasis of the paper is on the requirement for novel methodologies to address these discrepancies. Sparse sampling strategies adequate for datasets like UCF101 fall short on FineGym. Additionally, existing pre-training paradigms, particularly those that use extensive video datasets like Kinetics, might not translate well to the requirements of fine-grained action analysis due to differing action dynamics.

The implications of this research are both practical and theoretical. Practically, FineGym provides a challenging benchmark for evaluating and developing new techniques for action segmentation, recognition, and potentially auto-scoring in sport analytics, given its comprehensive and structured annotations. Theoretically, the dataset identifies significant gaps in current model capabilities, indicating a need for exploration into more sophisticated, possibly multi-modal, approaches that can concurrently handle complex spatio-temporal dynamics.

FineGym thus stands as a pivotal resource in advancing fine-grained action understanding, providing researchers with the means to explore and model the intricacies inherent in athletic movements. Future developments may explore leveraging FineGym for various applications like multi-attribute prediction and model interpretability, utilizing its unique structure and annotations to enhance algorithmic performance across a range of fine-grained tasks in computer vision.