Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding (2307.05721v1)

Published 9 Jul 2023 in cs.CV

Abstract: Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD - the first human assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance for comprehending knowledge in assembly progress, process efficiency, task collaboration, skill parameters and human intention. Details of HA-ViD is available at: https://iai-hrc.github.io/ha-vid.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. D. A. Duque, F. A. Prieto, and J. G. Hoyos, “Trajectory generation for robotic assembly operations using learning by demonstration,” Robotics and Computer Integrated Manufacturing, vol. 57, no. December 2018, pp. 292–302, 2019.
  2. E. Lamon, A. De Franco, L. Peternel, and A. Ajoudani, “A Capability-Aware Role Allocation Approach to Industrial Assembly Tasks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3378–3385, 2019.
  3. F. Frustaci, S. Perri, G. Cocorullo, and P. Corsonello, “An embedded machine vision system for an in-line quality check of assembly processes,” Procedia Manufacturing, vol. 42, pp. 211–218, 2020.
  4. G. Cicirelli, R. Marani, L. Romeo, M. G. Domínguez, J. Heras, A. G. Perri, and T. D’Orazio, “The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing,” Scientific Data, vol. 9, p. 745, dec 2022.
  5. Y. Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo, H. Li, and S. Gould, “The IKEA ASM Dataset: Understanding people assembling furniture through actions, objects and pose,” Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, pp. 846–858, 2021.
  6. F. Sener, R. Wang, and A. Yao, “Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities,” Cvpr, 2022.
  7. S. Toyer, A. Cherian, T. Han, and S. Gould, “Human Pose Forecasting via Deep Markov Models,” DICTA 2017 - 2017 International Conference on Digital Image Computing: Techniques and Applications, vol. 2017-Decem, pp. 1–8, 2017.
  8. J. Zhang, P. Byvshev, and Y. Xiao, “A video dataset of a wooden box assembly process: Dataset,” DATA 2020 - Proceedings of the 3rd Workshop on Data Acquisition To Analysis, Part of SenSys 2020, BuildSys 2020, pp. 35–39, 2020.
  9. F. Ragusa, A. Furnari, S. Livatino, and G. M. Farinella, “The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1568–1577, IEEE, jan 2021.
  10. M. Georgeff and A. Lansky, “Procedural knowledge,” Proceedings of the IEEE, vol. 74, no. 10, pp. 1383–1398, 1986.
  11. R. E. Mayer, “Should There Be a Three-Strikes Rule Against Pure Discovery Learning?,” American Psychologist, vol. 59, no. 1, pp. 14–19, 2004.
  12. J. Lin, C. Gan, and S. Han, “TSM: Temporal Shift Module for Efficient Video Understanding,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7082–7092, IEEE, oct 2019.
  13. G. Bertasius, H. Wang, and L. Torresani, “Is Space-Time Attention All You Need for Video Understanding?,” in Proceedings of the 38th International Conference on Machine Learning, pp. 813–824, feb 2021.
  14. J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733, IEEE, jul 2017.
  15. Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4794–4804, IEEE, jun 2022.
  16. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 7444–7452, jan 2018.
  17. D. Wang, D. Hu, X. Li, and D. Dou, “Temporal Relational Modeling with Self-Supervision for Action Segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737, dec 2021.
  18. Y. A. Farha and J. Gall, “MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2019-June, pp. 3570–3579, IEEE, jun 2019.
  19. Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-Aware Cascade Networks for Temporal Action Segmentation,” in ECCV, vol. Part XXV 1, pp. 34–51, 2020.
  20. Y. Amit and P. Felzenszwalb, “Object Detection,” in Computer Vision, pp. 537–542, Boston, MA: Springer US, 2014.
  21. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–1149, jun 2017.
  22. G. J. A. C. A. S. J. B. N. Y. K. K. M. T. J. F. i. L. Z. Y. C. W. A. V. D. M. Z. W. C. F. J. N. L. U. V. Jain, “YOLOv5,”
  23. H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection,” mar 2022.
  24. W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, and T. K. Kim, “Multiple object tracking: A literature review,” Artificial Intelligence, vol. 293, p. 103448, apr 2021.
  25. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468, IEEE, sep 2016.
  26. Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “ByteTrack: Multi-Object Tracking by Associating Every Detection Box,” in Proceedings of the European Conference on Computer Vision (ECCV), vol. 2, oct 2022.
  27. H. Kuehne, A. Arslan, and T. Serre, “The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787, IEEE, jun 2014.
  28. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,” may 2017.
  29. T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common Objects in Context,” may 2014.
  30. P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé, “MOT20: A benchmark for multi object tracking in crowded scenes,” mar 2020.
  31. D. Tu, W. Sun, X. Min, G. Zhai, and W. Shen, “Video-based Human-Object Interaction Detection from Tubelet Tokens,” in Advances in Neural Information Processing Systems 35, pp. 23345—-23357, 2022.
  32. M.-J. Chiou, C.-Y. Liao, L.-W. Wang, R. Zimmermann, and J. Feng, “ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos,” in Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval, (New York, NY, USA), pp. 9–17, ACM, aug 2021.
  33. O. Mees, M. Merklinger, G. Kalweit, and W. Burgard, “Adversarial Skill Networks: Unsupervised Robot Skill Learning from Video,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4188–4194, IEEE, may 2020.
  34. P. Zheng, S. Li, L. Xia, L. Wang, and A. Nassehi, “A visual reasoning-based approach for mutual-cognitive human-robot collaboration,” CIRP Annals, vol. 71, no. 1, pp. 377–380, 2022.
  35. J. Jeon, H.-r. Jung, F. Yumbla, T. A. Luong, and H. Moon, “Primitive Action Based Combined Task and Motion Planning for the Service Robot,” Frontiers in Robotics and AI, vol. 9, feb 2022.
  36. E. Berger, S. Grehl, D. Vogt, B. Jung, and H. B. Amor, “Experience-based torque estimation for an industrial robot,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 144–149, IEEE, may 2016.
  37. Y. Lu, H. Zheng, S. Chand, W. Xia, Z. Liu, X. Xu, L. Wang, Z. Qin, and J. Bao, “Outlook on human-centric manufacturing towards Industry 5.0,” Journal of Manufacturing Systems, vol. 62, pp. 612–627, jan 2022.
  38. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019, IEEE, jun 2016.
  39. J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, jun 2009.
  40. C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked Feature Prediction for Self-Supervised Visual Pre-Training,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14648–14658, IEEE, jun 2022.
  41. K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open MMLab Detection Toolbox and Benchmark,” jun 2019.
  42. Z.-q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object Detection With Deep Learning: A Review,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, pp. 3212–3232, nov 2019.
  43. T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé, and K. Crawford, “Datasheets for Datasets,” mar 2018.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com