Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks (1612.09401v1)

Published 30 Dec 2016 in cs.CV

Abstract: Convolutional Neural Networks (ConvNets) have recently shown promising performance in many computer vision tasks, especially image-based recognition. How to effectively apply ConvNets to sequence-based data is still an open problem. This paper proposes an effective yet simple method to represent spatio-temporal information carried in $3D$ skeleton sequences into three $2D$ images by encoding the joint trajectories and their dynamics into color distribution in the images, referred to as Joint Trajectory Maps (JTM), and adopts ConvNets to learn the discriminative features for human action recognition. Such an image-based representation enables us to fine-tune existing ConvNets models for the classification of skeleton sequences without training the networks afresh. The three JTMs are generated in three orthogonal planes and provide complimentary information to each other. The final recognition is further improved through multiply score fusion of the three JTMs. The proposed method was evaluated on four public benchmark datasets, the large NTU RGB+D Dataset, MSRC-12 Kinect Gesture Dataset (MSRC-12), G3D Dataset and UTD Multimodal Human Action Dataset (UTD-MHAD) and achieved the state-of-the-art results.

Citations (207)

View on Semantic Scholar

Summary

The paper presents a novel method that transforms 3D skeleton trajectories into color-encoded 2D Joint Trajectory Maps for effective action recognition.
It fine-tunes pre-trained ConvNet models on projections of joint data and uses a multiply score fusion technique to enhance classification performance.
The approach achieves up to 81.08% accuracy on NTU RGB+D, outperforming traditional RNN and LSTM methods in cross-subject and cross-view scenarios.

Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks

The paper presents an innovative approach to human action recognition via an effective application of Convolutional Neural Networks (ConvNets) to sequence-based data, specifically focusing on 3D skeleton sequences. This is achieved by transforming the temporal dynamics and spatial configuration of joint trajectories into color-encoded 2D representations called Joint Trajectory Maps (JTMs). These JTMs are generated by projecting the skeleton sequences onto three orthogonal planes, each offering unique complementary information, which allows for fine-tuning of existing ConvNet models on skeleton data without having to be retrained from scratch.

Methodology

The approach introduced in this work involves several critical steps:

JTM Construction: The sequence of joint trajectories is represented as three distinct 2D images. This is accomplished by projecting the 3D trajectories onto three orthogonal planes. The motion dynamics are encoded using color, where hue indicates the direction of motion, and saturation and brightness correspond to the magnitude of motion. Different colormaps are applied to distinguish various body parts, enhancing the encoding of spatial-temporal information.
ConvNets Training: The JTMs serve as input to three separate ConvNets, each tasked with learning discriminative features specific to the projected plane. These networks are fine-tuned using pre-existing ConvNet architectures trained on ImageNet data, significantly reducing the computational burden associated with training.
Score Fusion: To enhance recognition accuracy, a multiply score fusion technique is applied. This method effectively combines the outputs of the ConvNets to make a final classification decision, capitalizing on the complementary nature of the three JTMs.

Numerical Results and Claims

The evaluation of the proposed method on four well-recognized public datasets demonstrates superior performance compared to existing methods, achieving state-of-the-art results in several cases. Notably, the method achieved accuracies of 76.32% and 81.08% on the NTU RGB+D dataset for cross-subject and cross-view scenarios, respectively, outperforming previous techniques such as Part-aware LSTM and ST-LSTM with Trust Gate. Additionally, substantial improvements were observed over RNN-based architectures, highlighting the robustness of using image-based representations with ConvNets.

Implications and Future Prospects

Practically, the proposed method simplifies the application of deep learning to 3D skeleton data by converting the problem to an image-based one, leveraging pre-trained ConvNet models more effectively. Theoretically, it introduces an innovative way to encode dynamic joint information into static representations, enhancing the spatio-temporal feature extraction.

Looking ahead, this approach can be extended to applications such as online action recognition and might see adaptations that involve other forms of data augmentation to further enhance cross-view recognition capabilities. Future work could explore optimizing the selection of orthogonal planes to minimize self-occlusion and the adaptation of JTMs for different sensor technologies beyond the current depth-camera setup. As the field progresses, the method may also be tailored to synergize with multi-modal systems, integrating this image-based skeleton representation with other data modalities for comprehensive action recognition solutions.

PDF Markdown