- The paper presents a novel method that transforms 3D skeleton trajectories into color-encoded 2D Joint Trajectory Maps for effective action recognition.
- It fine-tunes pre-trained ConvNet models on projections of joint data and uses a multiply score fusion technique to enhance classification performance.
- The approach achieves up to 81.08% accuracy on NTU RGB+D, outperforming traditional RNN and LSTM methods in cross-subject and cross-view scenarios.
Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks
The paper presents an innovative approach to human action recognition via an effective application of Convolutional Neural Networks (ConvNets) to sequence-based data, specifically focusing on 3D skeleton sequences. This is achieved by transforming the temporal dynamics and spatial configuration of joint trajectories into color-encoded 2D representations called Joint Trajectory Maps (JTMs). These JTMs are generated by projecting the skeleton sequences onto three orthogonal planes, each offering unique complementary information, which allows for fine-tuning of existing ConvNet models on skeleton data without having to be retrained from scratch.
Methodology
The approach introduced in this work involves several critical steps:
- JTM Construction: The sequence of joint trajectories is represented as three distinct 2D images. This is accomplished by projecting the 3D trajectories onto three orthogonal planes. The motion dynamics are encoded using color, where hue indicates the direction of motion, and saturation and brightness correspond to the magnitude of motion. Different colormaps are applied to distinguish various body parts, enhancing the encoding of spatial-temporal information.
- ConvNets Training: The JTMs serve as input to three separate ConvNets, each tasked with learning discriminative features specific to the projected plane. These networks are fine-tuned using pre-existing ConvNet architectures trained on ImageNet data, significantly reducing the computational burden associated with training.
- Score Fusion: To enhance recognition accuracy, a multiply score fusion technique is applied. This method effectively combines the outputs of the ConvNets to make a final classification decision, capitalizing on the complementary nature of the three JTMs.
Numerical Results and Claims
The evaluation of the proposed method on four well-recognized public datasets demonstrates superior performance compared to existing methods, achieving state-of-the-art results in several cases. Notably, the method achieved accuracies of 76.32% and 81.08% on the NTU RGB+D dataset for cross-subject and cross-view scenarios, respectively, outperforming previous techniques such as Part-aware LSTM and ST-LSTM with Trust Gate. Additionally, substantial improvements were observed over RNN-based architectures, highlighting the robustness of using image-based representations with ConvNets.
Implications and Future Prospects
Practically, the proposed method simplifies the application of deep learning to 3D skeleton data by converting the problem to an image-based one, leveraging pre-trained ConvNet models more effectively. Theoretically, it introduces an innovative way to encode dynamic joint information into static representations, enhancing the spatio-temporal feature extraction.
Looking ahead, this approach can be extended to applications such as online action recognition and might see adaptations that involve other forms of data augmentation to further enhance cross-view recognition capabilities. Future work could explore optimizing the selection of orthogonal planes to minimize self-occlusion and the adaptation of JTMs for different sensor technologies beyond the current depth-camera setup. As the field progresses, the method may also be tailored to synergize with multi-modal systems, integrating this image-based skeleton representation with other data modalities for comprehensive action recognition solutions.