Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection (2404.06194v2)
Abstract: Open-vocabulary human-object interaction (HOI) detection, which is concerned with the problem of detecting novel HOIs guided by natural language, is crucial for understanding human-centric scenes. However, prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances, leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-LLMs (VLMs). Specifically, we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore, by leveraging LLMs such as GPT models, we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release.
- Detecting human-object interactions via functional generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10460â10469, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models, 2023a.
- Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23492â23503, 2023b.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213â229. Springer, 2020.
- Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv), pages 381â389. IEEE, 2018.
- Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9004â9013, 2021.
- Category-aware transformer network for better human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19538â19547, 2022.
- Link the head to the" beak": Zero shot learning from noisy text description at part precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5640â5649, 2017.
- Christiane Fellbaum. WordNet: An electronic lexical database. MIT press, 1998.
- ican: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018.
- Drg: Dual relation graph for human-object interaction detection. In European Conference on Computer Vision, pages 696â712. Springer, 2020.
- Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8359â8367, 2018.
- No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9677â9685, 2019.
- Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5994â6002, 2017.
- Visual compositional learning for human-object interaction detection. In European Conference on Computer Vision, pages 584â600. Springer, 2020.
- Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 495â504, 2021a.
- Detecting human-object interaction via fabricated compositional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14646â14655, 2021b.
- What to look at and where: Semantic and spatial refined transformer for detecting human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5353â5363, 2022.
- Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024.
- Compositional learning for human object interaction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 234â251, 2018.
- Multi-modal classifiers for open-vocabulary object detection, 2023.
- Uniondet: Union-level detector towards real-time human-object interaction detection. In European Conference on Computer Vision, pages 498â514. Springer, 2020.
- Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 74â83, 2021.
- Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19578â19587, 2022.
- Relational context learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2925â2934, 2023.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83â97, 1955.
- Efficient adaptive human-object interaction detection with concept-guided memory. 2023.
- Zero-shot visual relation detection via composite visual cues from large language models, 2023.
- Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3585â3594, 2019.
- Pastanet: Toward human activity knowledge engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 382â391, 2020.
- Improving human-object interaction detection via phrase learning and label composition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1509â1517, 2022.
- Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 482â490, 2020.
- Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20123â20132, 2022.
- Interactiveness field in human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20113â20122, 2022.
- Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
- I2dformer: Learning image to document attention for zero-shot image classification. Advances in Neural Information Processing Systems, 35:12283â12294, 2022.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169â15179, 2023.
- Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23507â23517, 2023.
- Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342â26362. PMLR, 2023.
- Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17152â17162, 2023.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691â15701, 2023.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748â8763. PMLR, 2021.
- Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 49â58, 2016.
- Integrating language guidance into vision-based deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16177â16189, 2022.
- Waffling around for performance: Visual classification with random words and broad concepts, 2023.
- K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558â15573, 2022.
- Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10410â10419, 2021.
- Agglomerative transformer for human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21614â21624, 2023.
- Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13617â13626, 2020.
- Weakly-supervised hoi detection from interaction labels only and language/vision-language priors, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13475â13484, 2021.
- Learning transferable human-object interaction detectors with natural language supervision. In CVPR, 2022.
- End-to-end zero-shot hoi detection via vision and language knowledge distillation. arXiv preprint arXiv:2204.03541, 2022.
- Category query learning for human-object interaction classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15275â15284, 2023.
- A graph-based interactive reasoning for human-object interaction detection. arXiv preprint arXiv:2007.06925, 2020.
- Contextual object detection with multimodal large language models, 2023.
- Mining the benefits of two-stage and one-stage hoi detection. Advances in Neural Information Processing Systems, 34:17209â17220, 2021a.
- Spatially conditioned graphs for detecting humanâobject interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13319â13327, 2021b.
- Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20104â20112, 2022a.
- Exploring predicate visual context in detecting humanâobject interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10411â10421, 2023.
- Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19548â19557, 2022b.
- Open-category human-object interaction pre-training via language modeling framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19392â19402, 2023.
- Polysemy deciphering network for human-object interaction detection. In Computer VisionâECCV 2020: 16th European Conference, Glasgow, UK, August 23â28, 2020, Proceedings, Part XX 16, pages 69â85. Springer, 2020.
- Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13234â13243, 2021.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337â2348, 2022.
- Relation parsing neural network for human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 843â851, 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.