Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos (2402.08875v4)

Published 14 Feb 2024 in cs.CV

Abstract: The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (10)
  1. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. ICLR camera-ready version.
  2. The smallest sample size for the desired diagnosis accuracy. Int J Oncol Cancer Ther, 2:13–19, 2017.
  3. The ”something something” video database for learning and evaluating visual common sense. arXiv preprint arXiv:1706.04261, 2017.
  4. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  5. Hmdb: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011.
  6. Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972, 2023.
  7. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. CRCV-TR-12-01.
  8. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
  9. Videomae v2: Scaling video masked autoencoders with dual masking. arXiv preprint arXiv:2303.16727, 2023. CVPR 2023 camera-ready version.
  10. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.