Actor-agnostic Multi-label Action Recognition with Multi-modal Query (2307.10763v3)
Abstract: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.
- VQA: Visual question answering. In ICCV, 2015.
- VIVIT: A video vision transformer. In ICCV, 2021.
- A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
- Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
- Monitoring animal behavior in the smart vivarium. In Measuring Behavior. Wageningen The Netherlands, 2005.
- Is space-time attention all you need for video understanding? In ICML, 2021.
- Marc-André Carbonneau. Multiple instance learning under real-world conditions. PhD thesis, École de Technologie Supérieure, 2017.
- End-to-end object detection with transformers. In ECCV, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.
- Decoupling zero-shot semantic segmentation. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Pyslowfast. https://github.com/facebookresearch/slowfast, 2020.
- Multiscale vision transformers. In ICCV, 2021.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, 2019.
- Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
- Clip-Adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Actor-transformers for group activity recognition. In CVPR, 2020.
- Video action transformer network. In CVPR, 2019.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- The thumos challenge on action recognition for videos “in the wild”. CVIU, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Late temporal modeling in 3d cnn architectures with bert for action recognition. In ECCV, 2020.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Human action recognition and prediction: A survey. IJCV, 2022.
- Hmdb: a large video database for human motion recognition. In ICCV, 2011.
- Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
- Segmental spatiotemporal cnns for fine-grained action segmentation. In ECCV, 2016.
- Language-driven semantic segmentation. In ICLR, 2022.
- Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE TPAMI, 2020.
- Action recognition based on multimode fusion for vr online platform. VR, 2023.
- Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, 2019.
- DAB-DETR: Dynamic anchor boxes are better queries for DETR. In ICLR, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Video swin transformer. In CVPR, 2022.
- Zero-shot temporal action detection via vision-language prompting. In ECCV, 2022.
- Animal kingdom: A large and diverse dataset for animal behavior understanding. In CVPR, 2022.
- Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- TorchMetrics - Measuring Reproducibility in PyTorch, Feb. 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, 2022.
- Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Fine-tuned clip models are efficient video learners. In CVPR, 2023.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Multi-label class-imbalanced action recognition in hockey videos via 3d convolutional neural networks. In SNPD, 2018.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- A relationship between the average precision and the area under the roc curve. In ICTIR, 2015.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Going deeper with image transformers. In ICCV, 2021.
- Video classification with channel-separated convolutional networks. In ICCV, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Attention is all you need. NeurIPS, 2017.
- VideoMAE V2: Scaling video masked autoencoders with dual masking. In CVPR, 2023.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Towards data-efficient detection transformers. In ECCV, 2022.
- Non-local neural networks. In CVPR, 2018.
- Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
- Camp: Cross-modal adaptive message passing for text-image retrieval. In ICCV, 2019.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, 2023.
- Two-stream region convolutional 3d network for temporal activity detection. IEEE TPAMI, 2019.
- Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- Multiview transformers for video recognition. In CVPR, 2022.
- VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. In ECCV, 2022.
- Multi-label activity recognition using activity-specific features and activity correlations. In CVPR, 2021.
- Temporal action detection with structured segment networks. In ICCV, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Extract free dense labels from clip. In ECCV, 2022.
- Learning to prompt for vision-language models. IJCV, 2022.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
- Deep-learning-enhanced human activity recognition for internet of healthcare things. IEEE IoT, 2020.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
- Anindya Mondal (6 papers)
- Sauradip Nag (23 papers)
- Joaquin M Prada (1 paper)
- Xiatian Zhu (139 papers)
- Anjan Dutta (41 papers)