Skeleton-based Action Recognition Using LSTM and CNN (1707.02356v1)

Published 6 Jul 2017 in cs.CV

Abstract: Recent methods based on 3D skeleton data have achieved outstanding performance due to its conciseness, robustness, and view-independent representation. With the development of deep learning, Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM)-based learning methods have achieved promising performance for action recognition. However, for CNN-based methods, it is inevitable to loss temporal information when a sequence is encoded into images. In order to capture as much spatial-temporal information as possible, LSTM and CNN are adopted to conduct effective recognition with later score fusion. In addition, experimental results show that the score fusion between CNN and LSTM performs better than that between LSTM and LSTM for the same feature. Our method achieved state-of-the-art results on NTU RGB+D datasets for 3D human action analysis. The proposed method achieved 87.40% in terms of accuracy and ranked $1^{st}$ place in Large Scale 3D Human Activity Analysis Challenge in Depth Videos.

Citations (164)

View on Semantic Scholar

Summary

The paper proposes a dual-network framework integrating LSTM and CNN with a novel score fusion method for improved skeleton-based human action recognition.
The method achieved high accuracy rates of 82.89% (cross-subject) and 90.10% (cross-view) on the NTU RGB+D dataset, outperforming prior models.
Score fusion, particularly the multiply method, proved effective for combining network outputs, showing potential for applications like surveillance and human-computer interaction.

Overview of Skeleton-Based Action Recognition Using LSTM and CNN

Skeleton-based human action recognition has become a pivotal area of research within computer vision, offering enhanced accuracy in scenarios where RGB data falls short due to illumination variations or viewpoint dependencies. This paper explores the integration of Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks to maximize temporal and spatial information capture for the application of 3D human action recognition using skeleton data.

Methodological Foundation

The authors propose a dual-network approach wherein temporal dependencies are modeled using LSTM networks, while CNNs are utilized for spatial-context learning. The innovation lies in the score fusion methodology, which combines the LSTM and CNN outputs to improve recognition outcomes. The strategy of score fusion optimally balances between networks, leveraging LSTM's ability to retain and model temporal sequence information without losing spatial detail through complementary CNN processing.

Experimental Results

The research was validated on NTU RGB+D Dataset, a standard in the field for 3D human action analysis. This dataset is robust, containing diverse action classes across various viewpoints and subject demographics. The proposed method achieved an accuracy of 82.89% on cross-subject settings and 90.10% on cross-view settings, surpassing previous models, including deep hierarchical RNNs and other convolutional models. The score fusion technique exhibited superiority over other fusion methods, such as max-score and average-score fusion, in terms of performance metrics.

Key Insights and Contrasts

A notable finding was the efficacy of multiply-score fusion, outperforming concatenation methods for feature vectors, illustrating that complexity in feature aggregation does not inherently translate to accuracy improvements. Moreover, the paper’s results in the Large Scale 3D Human Activity Analysis Challenge demonstrate practical effectiveness, achieving the highest accuracy amongst competing methods.

Future Directions and Implications

The methodological framework introduced opens avenues for leveraging the capabilities of data-driven models in multimodal fusion applications. Future research could explore adapting this dual-network approach to other forms of sequence data and further refining the types of input features considered, potentially integrating additional modalities like embodied semantics or gesture timing for enriched action recognition.

The implications of this research extend to real-world applications, such as intelligent surveillance systems, advanced human-computer interaction interfaces, and ergonomic assessments in workplace environments. The combination of LSTM and CNN highlights the potential for balance between complex temporal sequence analysis and spatial feature extraction, setting a precedent for further developments in AI-driven action recognition systems.

In summary, the paper presents a comprehensive framework for skeleton-based human action recognition. Its demonstrable successes in handling large-scale data and improving recognition accuracy underpin its relevance for current and future explorations in computational human behavior analysis.