HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models (2305.09948v5)
Abstract: Human-Object Interaction (HOI) detection is a task to localize humans and objects in an image and predict the interactions in human-object pairs. In real-world scenarios, HOI detection models need systematic generalization, i.e., generalization to novel combinations of objects and interactions, because the train data are expected to cover a limited portion of all possible combinations. To evaluate the systematic generalization performance of HOI detection models, we created two new sets of HOI detection data splits named HICO-DET-SG and V-COCO-SG based on the HICO-DET and V-COCO datasets, respectively. When evaluated on the new data splits, HOI detection models with various characteristics performed much more poorly than when evaluated on the original splits. This shows that systematic generalization is a challenging goal in HOI detection. By analyzing the evaluation results, we also gain insights for improving the systematic generalization performance and identify four possible future research directions. We hope that our new data splits and presented analysis will encourage further research on systematic generalization in HOI detection.
- Systematic generalization: What is required and can it be learned? In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=HkezXnA9YX.
- CLOSURE: Assessing systematic generalization of CLEVR models. arXiv preprint, arXiv:1912.05783v2, 2020. DOI 10.48550/arXiv.1912.05783.
- Explanation-based weakly-supervised learning of visual relations with graph networks. In Proceedings of the 16th European Conference on Computer Vision (ECCV), pp. 612–630, 2020. DOI 10.1007/978-3-030-58604-1_37.
- Marco Baroni. Linguistic generalization and compositionality in modern artificial neural networks. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1791):20190307, 2020. DOI 10.1098/rstb.2019.0307.
- Systematic generalization with edge transformers. In Advances in Neural Information Processing Systems 34 (NeurIPS), pp. 1390–1402, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/0a4dc6dae338c9cb08947c07581f77a2-Abstract.html.
- End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV), pp. 213–229, 2020. DOI 10.1007/978-3-030-58452-8_13.
- HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1017–1025, 2015. DOI 10.1109/ICCV.2015.122.
- Learning to detect human-object interactions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 381–389, 2018. DOI 10.1109/WACV.2018.00048 The HICO-DET dataset is publicly available at http://www-personal.umich.edu/~ywchao/hico/ (Accessed on September 29th, 2022).
- QAHOI: Query-based anchors for human-object interaction detection. arXiv preprint, arXiv:2112.08647, 2021. DOI 10.48550/arXiv.2112.08647.
- Reformulating HOI detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9000–9009, 2021. DOI 10.1109/CVPR46437.2021.00889.
- ST-HOI: A spatial-temporal baseline for human-object interaction detection in videos. In Proceedings of the Workshop on Intelligent Cross-Data Analysis and Retrieval, pp. 9–17, 2021. DOI 10.1145/3463944.3469097.
- How modular should neural module networks be for systematic generalization? In Advances in Neural Information Processing Systems 34 (NeurIPS), pp. 23374–23385, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/c467978aaae44a0e8054e174bc0da4bb-Abstract.html.
- The Pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111:98–136, 2014. DOI 10.1007/s11263-014-0733-5.
- Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988. DOI 10.1016/0010-0277(88)90031-5.
- iCAN: Instance-centric attention network for human-object interaction detection. In Proceedings of the 29th British Machine Vision Conference (BMVC), 2018. DOI 10.48550/arXiv.1808.10437.
- DRG: Dual relation graph for human-object interaction detection. In Proceedings of the 16th European Conference on Computer Vision (ECCV), 2020. DOI 10.1007/978-3-030-58610-2_41.
- Detecting and recognizing human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8359–8367, 2018. DOI 10.1109/CVPR.2018.00872.
- AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6047–6056, 2018. DOI 10.1109/CVPR.2018.00633.
- Visual semantic role labeling. arXiv preprint, arXiv:1505.04474, 2015. DOI 10.48550/arXiv.1505.04474 The V-COCO dataset is publicly available at https://github.com/s-gupta/v-coco (Accessed on September 29th, 2022).
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. DOI 10.1109/CVPR.2016.90.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8320–8329, 2021. DOI 10.1109/ICCV48922.2021.00823.
- Visual compositional learning for human-object interaction detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 584–600, 2020. DOI 10.1007/978-3-030-58555-6_35.
- Discovering human-object interaction concepts via self-compositional learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 461–478, 2022. DOI 10.1007/978-3-031-19812-0_27.
- DisCo: Improving compositional generalization in visual reasoning through distribution coverage. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=EgHnKOLaKW.
- Few-shot human-object interaction recognition with semantic-guided attentive prototypes network. IEEE Transactions on Image Processing, 30:1648–1661, 2021. DOI 10.1109/TIP.2020.3046861.
- Bongard-HOI: Benchmarking few-shot visual reasoning for human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19034–19043, 2022. DOI 10.1109/CVPR52688.2022.01847.
- Image retrieval using scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678, 2015. DOI 10.1109/CVPR.2015.7298990.
- CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1988–1997, 2017. DOI 10.1109/CVPR.2017.215.
- AMMUS : A survey of transformer-based pretrained models in natural language processing. arXiv preprint, arXiv:2108.05542, 2021. DOI 10.48550/arXiv.2108.05542.
- Self-Modularized Transformer: Learn to modularize networks for systematic generalization. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, pp. 599–606, 2023. DOI 10.5220/0011682100003417.
- Zero-shot scene graph relation prediction through commonsense knowledge integration. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp. 466–482, 2021. DOI 10.1007/978-3-030-86520-7_29.
- Transformers in vision: A survey. ACM Computing Survey, 54(10s):1–41, 2021. DOI 10.1145/3505244.
- HOTR: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 74–83, 2021. DOI 10.1109/CVPR46437.2021.00014 The official source code of HOTR is publicly available at https://github.com/kakaobrain/HOTR (Accessed on September 29th, 2022).
- COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9087–9105, 2020. DOI 10.18653/v1/2020.emnlp-main.731.
- Learning human activities and object affordances from RGB-D videos. International Journal of Robotics Research, 32(8):951–970, 2013. DOI 10.1177/0278364913478446.
- Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 2873–2882, 2018. URL http://proceedings.mlr.press/v80/lake18a.html.
- Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017. DOI 10.1017/S0140525X16001837.
- Human few-shot learning of compositional instructions. In Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci), pp. 611–617, 2019. URL https://dblp1.uni-trier.de/rec/conf/cogsci/LakeLB19.html.
- Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3580–3589, 2019. DOI 10.1109/CVPR.2019.00370.
- HOI analysis: Integrating and decomposing human-object interaction. In Advances in Neural Information Processing Systems 33 (NeurIPS), pp. 5011–5022, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/3493894fa4ea036cfc6433c3e2ee63b0-Abstract.html.
- PPDM: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 482–490, 2020. DOI 10.1109/CVPR42600.2020.00056.
- GEN-VLKT: Simplify association and enhance interaction understanding for HOI detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20123–20132, 2022. DOI 10.1109/CVPR52688.2022.01949.
- A survey of transformers. AI Open, 3:111–132, 2022. DOI 10.1016/j.aiopen.2022.10.001.
- Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), pp. 740–755, 2014. DOI 10.1007/978-3-319-10602-1_48.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002, 2021. DOI 10.1109/ICCV48922.2021.00986.
- Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Visual relationship detection with language priors. In Proceedings of the 14th European Conference on Computer Vision (ECCV), pp. 852–869, 2016. DOI 10.1007/978-3-319-46448-0_51.
- FGAHOI: Fine-grained anchors for human-object interaction detection. arXiv preprint, arXiv:2301.04019, 2023. DOI 10.48550/arXiv.2301.04019 The official source code of FGAHOI is publicly available at https://github.com/xiaomabufei/FGAHOI (Accessed on March 23th, 2023).
- When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations. Nature Machine Intelligence, 4(2):146–153, 2022. DOI 10.1038/s42256-021-00437-5.
- Gary F. Marcus. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2001.
- Task-driven modular networks for zero-shot compositional learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3592–3601, 2019. DOI 10.1109/ICCV.2019.00369.
- A benchmark for systematic generalization in grounded language understanding. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. URL https://proceedings.neurips.cc/paper/2020/hash/e5a90182cc81e12ab5e72d66e0b46fe3-Abstract.html.
- Towards out-of-distribution generalization: A survey. arXiv preprint, arxiv.2108.13624, 2021. DOI 10.48550/arXiv:2108.13624.
- Neurocompositional computing: From the central paradox of cognition to a new generation of AI systems. AI Magazine, 43(3):308–322, 2022. DOI 10.1002/aaai.12065.
- QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10405–10414, 2021. DOI 10.1109/CVPR46437.2021.01027 The official source code of QPIC is publicly available at https://github.com/hitachi-rd-cv/qpic (Accessed on September 29th, 2022).
- Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3713–3722, 2020. DOI 10.1109/CVPR42600.2020.00377.
- On the value of out-of-distribution testing: An example of Goodhart’s law. In Advances in Neural Information Processing Systems 33 (NeurIPS), pp. 407–417, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/045117b0e0a11a242b9765e79cbf113f-Abstract.html.
- Image interpretation by iterative bottom-up top-down processing. Technical Report CBMM Memo No. 120, Center for Brains, Minds and Machines, 2021. URL https://cbmm.mit.edu/publications/image-interpretation-iterative-bottom-top-down-processing.
- Lack of combinatorial productivity in language processing with simple recurrent networks. Connection Science, 16(1):21–46, 2004. DOI 10.1080/09540090310001656597.
- Training neural networks to encode symbols enables combinatorial generalization. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1791):20190309, 2020. DOI 10.1098/rstb.2019.0309.
- Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS), 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13455–13464, 2021. DOI 10.1109/ICCV48922.2021.01322.
- Transformer Module Networks for systematic generalization in visual question answering. Technical Report CBMM Memo No. 121, Ver.2, Center for Brains, Minds and Machines, 2023. URL https://cbmm.mit.edu/publications/transformer-module-networks-systematic-generalization-visual-question-answering.
- OoD-Bench: Quantifying and understanding two dimensions of out-of-distribution generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7937–7948, 2022. DOI 10.1109/CVPR52688.2022.00779.
- RLIP: Relational language-image pre-training for human-object interaction detection. In Advances in Neural Information Processing Systems 35 (NeurIPS), pp. 37416–37431, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/f37347375d8b54e3203e5d24aeb6c58c-Abstract-Conference.html.
- Rlipv2: Fast scaling of relational language-image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21592–21604, 2023. DOI 10.1109/ICCV51070.2023.01979.
- Mining the benefits of two-stage and one-stage HOI detection. In Advances in Neural Information Processing Systems 34 (NeurIPS), 2021. URL https://papers.nips.cc/paper_files/paper/2021/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html.
- Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19526–19535, 2022. DOI 10.1109/CVPR52688.2022.01894 The official source code of STIP is publicly available at https://github.com/zyong812/STIP (Accessed on September 29th, 2022).
- Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=gZ9hCDWe6ke.
- End-to-end human object interaction detection with HOI transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11820–11829, 2021. DOI 10.1109/CVPR46437.2021.01165.