NTU-X: An Enhanced Large-scale Dataset for Improving Pose-based Recognition of Subtle Human Actions (2101.11529v4)

Published 27 Jan 2021 in cs.CV, cs.GR, cs.HC, and cs.MM

Abstract: The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models. Despite this bottleneck, community's efforts seem to be invested only in coming up with novel architectures. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTU-RGBD, NTU60-X and NTU120-X dataset includes finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on previously worst performing action categories. Code and pretrained models can be found at https://github.com/skelemoa/ntu-x .

Citations (16)

View on Semantic Scholar

Summary

The paper introduces NTU-X, an enhanced dataset that adds 51 facial and 42 finger joints to improve pose-based recognition of subtle human actions.
The dataset was curated using advanced SMPL-X and ExPose models, ensuring high-quality action sequences for robust benchmarking.
Experimental results demonstrate that incorporating detailed finger joints significantly improves recognition accuracy for intricate actions like typing and writing.

Overview of NTU-X: An Enhanced Dataset for Pose-based Human Action Recognition

This paper introduces NTU-X, an expanded large-scale dataset specifically designed to address the limitations in pose-based human action recognition systems which rely on coarse skeleton data. Standard datasets often lack detailed joint information, especially for facial and hand actions, creating a substantial bottleneck in accurately recognizing fine-grained human actions. The NTU-X dataset seeks to enhance skeleton action recognition models by providing richer joint data, including 51 facial joints and 42 finger joints, alongside the standard 25 body joints featured in previous datasets like NTU-RGBD.

Key Contributions

Introduction of NTU-X Dataset: The NTU-X consists of NTU60-X and NTU120-X, extending the NTU RGB+D dataset by integrating additional facial and finger joints. The dataset, including multi-person sequences, enhances action representations, critical for detailed actions involving hand gestures and facial expressions.
Dataset Curation: The dataset was developed using SMPL-X and ExPose models for pose estimation from RGB frames, overcoming limitations such as poor performance on blurred images or occlusions. This meticulous curation ensures a high-quality set of action sequences.
Dataset Application and Benchmarking: Several state-of-the-art models adapted to handle the newly included joints were benchmarked on NTU60-X and NTU120-X. The results showed improvements over previous datasets, particularly for subtle actions involving finger movements.
Ablation Studies: An analysis of different joint combinations highlighted the significance of finger joints over face joints in performance improvements for action recognition tasks. Finger joints contributed notably to resolving ambiguous actions like keyboard typing or writing, while facial joints were less influential.

Findings and Implications

The introduction of NTU-X reveals that existing models can achieve better performance when provided with richer, more detailed pose data. Specifically, DSTA-Net, one of the benchmarked models, achieved substantial gains, setting a new state-of-the-art benchmark for these skeleton-based action recognition tasks on the extended dataset.

Practical Implications

The enhanced dataset has significant practical implications in fields requiring detailed human activity analysis, such as surveillance systems or interactive applications where understanding nuanced gestures can lead to better human-computer interaction experiences. Moreover, by improving model accuracy on subtle actions, NTU-X provides a crucial resource for developing more responsive and precise multimedia systems.

Theoretical Implications

From a theoretical standpoint, NTU-X encourages further exploration into model architectures capable of leveraging dense skeleton representations. Future developments could involve creating models optimized for the new skeletal topology presented by NTU-X, potentially leading to innovative solutions in the domain of skeleton-based human action recognition.

Potential Directions for AI

This paper suggests several avenues for future research in AI:

Development of Models for Dense Representations: With NTU-X providing dense spatial data, developing models that can efficiently track and interpret complex joint configurations could be promising.
Cross-modal Fusion: As NTU-X includes detailed pose information, integrating these representations with other modalities like audio or text can aid in devising comprehensive AI systems for context-aware recognition.
Facial Joint Utilization: Investigating ways to optimize architectures to fully exploit facial joint data could potentially boost performance for actions involving subtle facial movements or expressions.

In conclusion, NTU-X steps forward in addressing the critical gaps in current datasets, paving the way for more effective and nuanced human action recognition methodologies. It shifts the focus from novel architecture development to improving the foundation data itself, offering a pivotal tool for future AI advancements in human action understanding.

PDF Markdown

Related Papers

GitHub

GitHub - skelemoa/ntu-x: NTU-X, which is an extended version of popular NTU dataset (83 stars)