STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition (2312.03288v1)
Abstract: Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer
- Ryoo Aggarwal. Human activity analysis: A review. 2011.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, 2018.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields, 2018.
- Channel-wise topology refinement graph convolution for skeleton-based action recognition. July 2021.
- Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition, 2022.
- Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20186–20196, June 2022.
- Spatial graph convolutional networks, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
- Dg-stgcn: Dynamic spatial-temporal modeling for skeleton-based action recognition, 2022.
- Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Focal and global Spatial-Temporal transformer for skeleton-based action recognition. October 2022.
- Global-local motion transformer for unsupervised skeleton-based action learning. July 2022.
- NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. May 2019.
- Skeleton-based action recognition via spatial and temporal transformer networks. August 2020.
- Fusing higher-order features in graph neural networks for skeleton-based action recognition. May 2021.
- A survey on 3D skeleton-based action recognition using learning method. February 2020.
- Ntu rgb+d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Two-stream adaptive graph convolutional networks for skeleton-based action recognition. May 2018.
- Decoupled spatial-temporal attention network for skeleton-based action recognition. July 2020a.
- Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition, 2020b.
- Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 10.1109/TPAMI.2022.3157033. URL https://doi.org/10.1109/TPAMI.2022.3157033.
- Attention is all you need, 2017.
- IIP-Transformer: Intra-Inter-Part transformer for Skeleton-Based action recognition. October 2021.
- Language supervised training for skeleton-based action recognition, 2022.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. January 2018.
- Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition, 2020.
- Wen Li Yuhan Zhang, Bo Wu. Spatial-temporal specialized transformer for skeleton-based action recognition. In Proc. ACM MM, 2021.
- Restormer: Efficient transformer for high-resolution image restoration, 2021.
- Semantics-guided neural networks for efficient skeleton-based human action recognition. April 2019.
- Zhengyou Zhang. Microsoft kinect sensor and its effect. In IEEE multimedia 19(2), 2012.
- Graph neural networks: A review of methods and applications, 2018.
- Hypergraph transformer for skeleton-based action recognition, 2022.