Skeleton-Contrastive 3D Action Representation Learning (2108.03656v1)

Published 8 Aug 2021 in cs.CV

Abstract: This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. Code is available at https://github.com/fmthoker/skeleton-contrast.

Citations (113)

View on Semantic Scholar

Summary

The paper introduces inter-skeleton contrastive learning to robustly learn invariant features across diverse 3D skeleton representations.
It develops skeleton-specific augmentations like pose augmentation, joint jittering, and temporal crop-resize to enhance noise and viewpoint invariance.
The paper achieves state-of-the-art performance on benchmarks such as NTU RGB+D 60 and PKU-MMD, with up to 85.2% accuracy improvement in cross-view scenarios.

Skeleton-Contrastive 3D Action Representation Learning

The paper "Skeleton-Contrastive 3D Action Representation Learning" presents a novel approach to self-supervised learning aimed at 3D skeleton-based action recognition. The authors propose utilizing contrastive learning to develop a feature space that is efficient for recognizing actions through 3D skeleton data, which represents the spatial coordinates of human joints. Their method specifically innovates by introducing inter-skeleton contrastive learning, which leverages different skeleton representations to enhance the learned semantic features.

Key Contributions

Inter-Skeleton Contrastive Learning:
- The main novelty lies in contrasting skeleton sequences instantiated in different representations—graph-based, sequence-based, and image-based—in a cross-contrastive learning framework. This results in learning invariant features that are less prone to shortcuts often encountered in contrastive learning tasks.
Skeleton-Specific Augmentations:
- The authors develop several spatial and temporal augmentation techniques tailored for skeleton data, including pose augmentation, joint jittering, and temporal crop-resize. These augmentations allow the model to learn invariance to changes in viewpoint, noise in joint estimation, and variations in the temporal boundaries of an action sequence.
Comprehensive Evaluation:
- The proposed approach achieves state-of-the-art performance in self-supervised learning for action recognition on prominent datasets including NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD. It demonstrates significant improvements in 3D action recognition, retrieval, and semi-supervised learning tasks compared to existing methods.

Numerical Results

The approach delivers considerable accuracy improvements across various tasks. In 3D action recognition on the NTU RGB+D 60 dataset, it achieves top-1 accuracy rates substantially more robust than competitor methods, with a marked improvement of up to 85.2% in cross-view scenarios. In semi-supervised learning settings, it surpasses previous techniques by leveraging its self-supervised pre-training phase, particularly when only a small fraction of training data is labeled.

Implications and Future Directions

The proposed inter-skeleton contrastive learning paradigm not only provides a robust framework for learning from unlabelled 3D skeleton data but also suggests broader applicability in domains requiring unsupervised feature learning under diverse representations. Future research could explore extending this framework to other types of data representations beyond skeleton-based action recognition, potentially benefiting recognition tasks that rely on multimodal inputs.

Moreover, the choice of specific skeleton augmentations and the benefits of contrasting diverse skeleton representations offer valuable insights for enhancing downstream task performance in other machine learning applications. Developers of future AI systems could integrate similar augmentation and cross-representation contrastive learning techniques to improve the generalizability and discrimination power of learned features in various domains.

This research contributes a significant advancement in the field of 3D action recognition and self-supervised learning frameworks, illustrating the potential of contrastive methodologies in addressing complex multi-representational learning challenges.

PDF Markdown

Related Papers

GitHub

GitHub - fmthoker/skeleton-contrast (41 stars)