Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 34 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition (2401.00409v1)

Published 31 Dec 2023 in cs.CV and cs.AI

Abstract: Human Interaction Recognition is the process of identifying interactive actions between multiple participants in a specific situation. The aim is to recognise the action interactions between multiple entities and their meaning. Many single Convolutional Neural Network has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In addition, the computational complexity of the Transformer cannot be ignored, and its ability to capture local information and motion features in the image is poor. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based streams that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Experimental results on diverse and challenging datasets, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Xing H, Burschka D, “Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network,” 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,, 2022: 5195-5201.
  2. Perez M, Liu J, Kot A C, “Interaction relational network for mutual action recognition,” IEEE Transactions on Multimedia, 2021, 24: 366-376.
  3. Raptis M, Sigal L, “Poselet key-framing: A model for human activity recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013: 2650-2657.
  4. Hochreiter S, Schmidhuber J, “Long short-term memory,” Neural computation, 1997, 9(8): 1735-1780.
  5. Du Y, Fu Y, Wang L, “Skeleton based action recognition with convolutional neural network,” 2015 3rd IAPR Asian conference on pattern recognition (ACPR). IEEE, 2015: 579-583.
  6. W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16, 2016, p. 3697–3703.
  7. J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in Computer Vision– ECCV 2016, Cham: Springer International Publishing, 2016, pp. 816–833.
  8. J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3671–3680.
  9. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2136–2145.
  10. J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton-based human action recognition with global context-aware attention lstm networks,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2018.
  11. B. Tekin, F. Bogo, and M. Pollefeys, “H+o: Unified egocentric recognition of 3d hand-object poses and interactions,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4506–4515.
  12. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 2018.
  13. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognitions,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3590–3598.
  14. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 018–12 027.
  15. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 140–149.
  16. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in 2021 IEEE International Conference on Computer Vision (ICCV), 2021, pp. 13 359–13 368.
  17. W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, ““Language supervised training for skeleton-based action recognition,” arXiv preprint, arXiv:2208.05318, 2022.
  18. S. Wang, Y. Zhang, M. Zhao, H. Qi, K. Wang, F. Wei, and Y. Jiang, “Skeleton-based action recognition via temporal-channel aggregation,” arXiv preprint, arXiv:2205.15936, 2022.
  19. J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 10444-10453.
  20. H.-G. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 154–20 164.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube