FreeA: Human-object Interaction Detection using Free Annotation Labels (2403.01840v2)
Abstract: Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets +\textbf{13.29} (\textbf{159\%$\uparrow$}) mAP and +\textbf{17.30} (\textbf{98\%$\uparrow$}) mAP than the newest Weakly'' supervised model, and +\textbf{7.19} (\textbf{28\%$\uparrow$}) mAP and +\textbf{14.69} (\textbf{34\%$\uparrow$}) mAP than the latest
Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.
- Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 381–389.
- Y. Liao, S. Liu, F. Wang, Y. Chen, C. Qian, and J. Feng, “PPDM: Parallel point detection and matching for real-time human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 482–490.
- T. Wang, T. Yang, M. Danelljan, F. S. Khan, X. Zhang, and J. Sun, “Learning human-object interaction detection using interaction points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4116–4125.
- X. Zhong, X. Qu, C. Ding, and D. Tao, “Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 234–13 243.
- B. Kim, T. Choi, J. Kang, and H. J. Kim, “UnionDet: Union-level detector towards real-time human-object interaction detection,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 498–514.
- H.-S. Fang, Y. Xie, D. Shao, and C. Lu, “DIRV: Dense interaction region voting for end-to-end human-object interaction detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1291–1299.
- M. Chen, Y. Liao, S. Liu, Z. Chen, F. Wang, and C. Qian, “Reformulating HOI detection as adaptive set prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9004–9013.
- C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei et al., “End-to-end human object interaction detection with HOI transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 825–11 834.
- Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “GEN-VLKT: Simplify association and enhance interaction understanding for hoi detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 123–20 132.
- Y. Wang, Q. Liu, and Y. Lei, “Ted-net: Dispersal attention for perceiving interaction region in indirectly-contact hoi detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
- W.-K. Lin, H.-B. Zhang, Z. Fan, J.-H. Liu, L.-J. Yang, Q. Lei, and J. Du, “Point-based learnable query generator for human–object interaction detection,” IEEE Transactions on Image Processing, vol. 32, pp. 6469–6484, 2023.
- T. He, L. Gao, J. Song, and Y.-F. Li, “Toward a unified transformer-based framework for scene graph generation and human-object interaction detection,” IEEE Transactions on Image Processing, vol. 32, pp. 6274–6288, 2023.
- G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and recognizing human-object interactions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8359–8367.
- Y. Liu, Q. Chen, and A. Zisserman, “Amplifying key cues for human-object-interaction detection,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 248–265.
- A. Bansal, S. S. Rambhatla, A. Shrivastava, and R. Chellappa, “Detecting human-object interactions via functional generalization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 10 460–10 469.
- A. Iftekhar, S. Kumar, R. A. McEver, S. You, and B. Manjunath, “GTNet: Guided transformer network for detecting human-object interactions,” arXiv preprint arXiv:2108.00596, 2021.
- X. Zhong, C. Ding, X. Qu, and D. Tao, “Polysemy deciphering network for human-object interaction detection,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 69–85.
- Y. Liu, J. Yuan, and C. W. Chen, “ConsNet: Learning consistency graph for zero-shot human-object interaction detection,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4235–4243.
- S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human-object interactions by graph parsing neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 401–417.
- Y. Gao, Z. Kuang, G. Li, W. Zhang, and L. Lin, “Hierarchical reasoning network for human-object interaction detection,” IEEE Transactions on Image Processing, vol. 30, pp. 8306–8317, 2021.
- D.-J. Kim, X. Sun, J. Choi, S. Lin, and I. S. Kweon, “Acp++: Action co-occurrence priors for human-object interaction detection,” IEEE Transactions on Image Processing, vol. 30, pp. 9150–9163, 2021.
- H. Wang, L. Jiao, F. Liu, L. Li, X. Liu, D. Ji, and W. Gan, “Ipgn: Interactiveness proposal graph network for human-object interaction detection,” IEEE Transactions on Image Processing, vol. 30, pp. 6583–6593, 2021.
- M. E. Unal and A. Kovashka, “Vlms and llms can help detect human-object interactions with weak supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, url: https://asu-apg.github.io/odrum/posters_normal-_\__2023/poster_6.pdf, 2023.
- S. K. Kumaraswamy, M. Shi, and E. Kijak, “Detecting human-object interaction with mixed supervision,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1228–1237.
- H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang, “Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 4233–4241.
- B. Wan, Y. Liu, D. Zhou, T. Tuytelaars, and X. He, “Weakly-supervised hoi detection via prior-guided bi-level representation learning,” International Conference on Learning Representations, 2023.
- T. Gupta, A. Schwing, and D. Hoiem, “No-frills human-object interaction detection: Factorization, layout encodings, and training techniques,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9677–9685.
- Z. Hou, X. Peng, Y. Qiao, and D. Tao, “Visual compositional learning for human-object interaction detection,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 584–600.
- Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 495–504.
- ——, “Detecting human-object interaction via fabricated compositional learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 646–14 655.
- J. Peyre, I. Laptev, C. Schmid, and J. Sivic, “Detecting unseen visual relations using analogies,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1981–1990.
- L. Shen, S. Yeung, J. Hoffman, G. Mori, and L. Fei-Fei, “Scaling human-object interaction recognition through zero-shot learning,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 1568–1576.
- O. Ulutan, A. Iftekhar, and B. S. Manjunath, “VSGNet: Spatial attention network for detecting human object interactions using graph convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 617–13 626.
- B. Xu, Y. Wong, J. Li, Q. Zhao, and M. S. Kankanhalli, “Learning to detect human-object interactions with knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
- S. Ning, L. Qiu, Y. Liu, and X. He, “Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 507–23 517.
- S. Eum and H. Kwon, “Semantics to space (s2s): Embedding semantics into spatial space for zero-shot verb-object query inferencing,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 1384–1391.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
- R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
- M. Tamura, H. Ohashi, and T. Yoshinaga, “QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 410–10 419.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213–229.
- Z. Li, C. Zou, Y. Zhao, B. Li, and S. Zhong, “Improving human-object interaction detection via phrase learning and label composition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1509–1517.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- M. Kilickaya and A. Smeulders, “Human-object interaction detection via weak supervision,” arXiv preprint arXiv:2112.00492, 2021.
- C. Gao, Y. Zou, and J.-B. Huang, “ICAN: Instance-centric attention network for human-object interaction detection,” arXiv preprint arXiv:1808.10437, 2018.
- B. Wan, D. Zhou, Y. Liu, R. Li, and X. He, “Pose-aware multi-level feature network for human object interaction detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9469–9478.
- Y.-L. Li, X. Liu, H. Lu, S. Wang, J. Liu, J. Li, and C. Lu, “Detailed 2d-3d joint representation for human-object interaction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 166–10 175.
- Y.-L. Li, X. Liu, X. Wu, Y. Li, and C. Lu, “HOI analysis: Integrating and decomposing human-object interaction,” Advances in Neural Information Processing Systems, vol. 33, pp. 5011–5022, 2020.
- B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “HOTR: End-to-end human-object interaction detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 74–83.
- B. Kim, J. Mun, K.-W. On, M. Shin, J. Lee, and E.-S. Kim, “Mstr: Multi-scale transformer for end-to-end human-object interaction detection,” in International Conference on Learning Representations, 2022, pp. 19 578–19 587.
- D. Zhou, Z. Liu, J. Wang, L. Wang, T. Hu, E. Ding, and J. Wang, “Human-object interaction detection via disentangled transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 568–19 577.
- S. Kim, D. Jung, and M. Cho, “Relational context learning for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2925–2934.
- F. Baldassarre, K. Smith, J. Sullivan, and H. Azizpour, “Explanation-based weakly-supervised learning of visual relations with graph networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16. Springer, 2020, pp. 612–630.
- F. Z. Zhang, D. Campbell, and S. Gould, “Spatially conditioned graphs for detecting human-object interactions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 319–13 327.
- Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, and C.-W. Chen, “Exploring structure-aware transformer over interaction proposals for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 548–19 557.