Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn (1704.05645v2)

Published 19 Apr 2017 in cs.CV

Abstract: This paper presents an image classification based approach for skeleton-based video action recognition problem. Firstly, A dataset independent translation-scale invariant image mapping method is proposed, which transformes the skeleton videos to colour images, named skeleton-images. Secondly, A multi-scale deep convolutional neural network (CNN) architecture is proposed which could be built and fine-tuned on the powerful pre-trained CNNs, e.g., AlexNet, VGGNet, ResNet etal.. Even though the skeleton-images are very different from natural images, the fine-tune strategy still works well. At last, we prove that our method could also work well on 2D skeleton video data. We achieve the state-of-the-art results on the popular benchmard datasets e.g. NTU RGB+D, UTD-MHAD, MSRC-12, and G3D. Especially on the largest and challenge NTU RGB+D, UTD-MHAD, and MSRC-12 dataset, our method outperforms other methods by a large margion, which proves the efficacy of the proposed method.

Citations (226)

Summary

  • The paper introduces a translation-scale invariant mapping technique that converts 3D skeleton data into color images for robust action recognition.
  • The method fine-tunes multi-scale deep CNN architectures like AlexNet, VGGNet, and ResNet, achieving over 11-12% accuracy improvements on key datasets.
  • The approach generalizes to both 3D and 2D skeleton data, offering promising applications in surveillance and human-computer interaction.

Analysis of "Skeleton Based Action Recognition Using Translation-Scale Invariant Image Mapping And Multi-Scale Deep CNN"

The paper "Skeleton Based Action Recognition Using Translation-Scale Invariant Image Mapping And Multi-Scale Deep CNN" addresses the challenging problem of skeleton-based video action recognition, which is pivotal in applications like human-computer interaction and video surveillance. The authors' work is grounded in transforming 3D skeleton data into a format suitable for image classification via deep convolutional neural networks (CNNs).

The authors have put forward a novel image mapping technique that is translation-scale invariant, encoding 3D skeleton videos into color images termed "skeleton-images." This approach counters the limitations of previous dataset-dependent methods that failed to preserve translation and scale invariance in mapping from 3D skeleton videos. By incorporating a translation-scale invariant mapping, the proposed method efficiently normalizes 3D skeleton data with respect to individual sequences rather than entire datasets. This feature avails effective recognition invariant of the subject's position or scale within the video.

Concurrently, the authors propose a multi-scale deep CNN architecture predicated on pre-trained networks such as AlexNet, VGGNet, and ResNet. These models are adeptly fine-tuned over the distinctly structured skeleton-images, despite their inherent dissimilarity from natural images for which the networks were originally designed. This fine-tuning strategy significantly optimizes training by circumventing the need to initialize from scratch, thereby boosting performance particularly in scenarios lacking extensive annotated data.

Empirical results highlight the efficacy of the method across prominent datasets, including NTU RGB+D, UTD-MHAD, MSRC-12, and G3D. Noteworthy is the substantial performance improvement observed on the NTU RGB+D dataset, with accuracy enhancements over 12% in cross-subject evaluation and 11% in cross-view evaluation compared to state-of-the-art methodologies. Such results imply robust generalizability of the proposed method to diverse and challenging situations encountered in practical applications.

The multi-scale CNN approach exploits the modality’s frequency variations, thereby enhancing classification robustness. By using different input sizes corresponding to multiple scales, the model captures a comprehensive set of features, reinforcing its discriminative power.

Furthermore, the method offers a versatile framework that extends beyond 3D skeleton data, achieving commendable results on 2D skeleton data too. This adaptability hints at potential applicability in environments with constraints on depth sensing technology.

Looking forward, the presented work might inspire extensions incorporating more advanced data augmentation techniques or integration with other modalities such as audio or ambient sensor data to further elevate action recognition robustness. Additionally, advancements in real-time processing efficiency could broaden its utility in real-world interactive applications. Overall, this research contributes significantly to the body of knowledge in skeleton-based action recognition, offering a meticulously engineered solution that pushes the boundaries of existing methodologies.