Actor-agnostic Multi-label Action Recognition with Multi-modal Query (2307.10763v3)
Abstract: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.
- VQA: Visual question answering. In ICCV, 2015.
- VIVIT: A video vision transformer. In ICCV, 2021.
- A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
- Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
- Monitoring animal behavior in the smart vivarium. In Measuring Behavior. Wageningen The Netherlands, 2005.
- Is space-time attention all you need for video understanding? In ICML, 2021.
- Marc-André Carbonneau. Multiple instance learning under real-world conditions. PhD thesis, École de Technologie Supérieure, 2017.
- End-to-end object detection with transformers. In ECCV, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.
- Decoupling zero-shot semantic segmentation. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Pyslowfast. https://github.com/facebookresearch/slowfast, 2020.
- Multiscale vision transformers. In ICCV, 2021.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, 2019.
- Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
- Clip-Adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Actor-transformers for group activity recognition. In CVPR, 2020.
- Video action transformer network. In CVPR, 2019.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- The thumos challenge on action recognition for videos “in the wild”. CVIU, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Late temporal modeling in 3d cnn architectures with bert for action recognition. In ECCV, 2020.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Human action recognition and prediction: A survey. IJCV, 2022.
- Hmdb: a large video database for human motion recognition. In ICCV, 2011.
- Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
- Segmental spatiotemporal cnns for fine-grained action segmentation. In ECCV, 2016.
- Language-driven semantic segmentation. In ICLR, 2022.
- Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE TPAMI, 2020.
- Action recognition based on multimode fusion for vr online platform. VR, 2023.
- Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, 2019.
- DAB-DETR: Dynamic anchor boxes are better queries for DETR. In ICLR, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Video swin transformer. In CVPR, 2022.
- Zero-shot temporal action detection via vision-language prompting. In ECCV, 2022.
- Animal kingdom: A large and diverse dataset for animal behavior understanding. In CVPR, 2022.
- Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- TorchMetrics - Measuring Reproducibility in PyTorch, Feb. 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, 2022.
- Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Fine-tuned clip models are efficient video learners. In CVPR, 2023.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Multi-label class-imbalanced action recognition in hockey videos via 3d convolutional neural networks. In SNPD, 2018.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- A relationship between the average precision and the area under the roc curve. In ICTIR, 2015.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Going deeper with image transformers. In ICCV, 2021.
- Video classification with channel-separated convolutional networks. In ICCV, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Attention is all you need. NeurIPS, 2017.
- VideoMAE V2: Scaling video masked autoencoders with dual masking. In CVPR, 2023.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Towards data-efficient detection transformers. In ECCV, 2022.
- Non-local neural networks. In CVPR, 2018.
- Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
- Camp: Cross-modal adaptive message passing for text-image retrieval. In ICCV, 2019.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, 2023.
- Two-stream region convolutional 3d network for temporal activity detection. IEEE TPAMI, 2019.
- Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- Multiview transformers for video recognition. In CVPR, 2022.
- VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. In ECCV, 2022.
- Multi-label activity recognition using activity-specific features and activity correlations. In CVPR, 2021.
- Temporal action detection with structured segment networks. In ICCV, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Extract free dense labels from clip. In ECCV, 2022.
- Learning to prompt for vision-language models. IJCV, 2022.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
- Deep-learning-enhanced human activity recognition for internet of healthcare things. IEEE IoT, 2020.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.