Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Motion Guided Token Compression for Efficient Masked Video Modeling (2402.18577v1)

Published 10 Jan 2024 in cs.CV and cs.AI

Abstract: Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. State-of-the-Art and Trends in Scalable Video Compression With Wavelet-Based Approaches. IEEE Transactions on Circuits and Systems for Video Technology, 17(9): 1238–1255.
  2. Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33: 9758–9770.
  3. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 6836–6846.
  4. A deep convolutional neural network for video sequence background subtraction. Pattern Recognition, 76: 635–649.
  5. AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14507–14517.
  6. BEiT: BERT Pre-Training of Image Transformers. arXiv:2106.08254.
  7. Is Space-Time Attention All You Need for Video Understanding? arXiv:2102.05095.
  8. A Fast and Accurate Eye Tracker Using Stroboscopic Differential Lighting. In 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), 502–510.
  9. Space-time Mixing Attention for Video Transformer. arXiv:2106.05968.
  10. Introduction to H. 264 advanced video coding. In Proceedings of the 2006 Asia and South Pacific Design Automation Conference, 736–741.
  11. Generative pretraining from pixels. In International conference on machine learning, 1691–1703. PMLR.
  12. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35: 16664–16678.
  13. P-CNN: Pose-Based CNN Features for Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  14. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  15. Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification. arXiv:1711.08200.
  16. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  17. Improved Residual Networks for Image and Video Recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), 9415–9422.
  18. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 6824–6835.
  19. Video-Based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI ’16, 445–450. New York, NY, USA: Association for Computing Machinery. ISBN 9781450345569.
  20. SlowFast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  21. SlowFast Networks for Video Recognition. arXiv:1812.03982.
  22. A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3299–3309.
  23. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35: 35946–35958.
  24. Video Action Transformer Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  25. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10406–10417.
  26. Distributed Video Coding. Proceedings of the IEEE, 93(1): 71–83.
  27. The ”something something” video database for learning and evaluating visual common sense. arXiv:1706.04261.
  28. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16000–16009.
  29. Morph: Flexible Acceleration for 3D CNN-Based Video Understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 933–946.
  30. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  31. Contrast and order representations for video self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7939–7949.
  32. The Kinetics Human Action Video Dataset. arXiv:1705.06950.
  33. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s): 1–41.
  34. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, 2556–2563. IEEE.
  35. Kumar, D. T. S. 2019. A novel method for HDR video encoding, compression and quality evaluation. Journal of Innovative Image Processing, 1(2): 71–80.
  36. A Recognition Method for Rice Plant Diseases and Pests Video Detection Based on Deep Convolutional Neural Network. Sensors, 20(3).
  37. Deep Contextual Video Compression. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 18114–18125. Curran Associates, Inc.
  38. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676.
  39. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552.
  40. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4804–4814.
  41. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  42. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
  43. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992–10002.
  44. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. CoRR, abs/2103.14030.
  45. Video Swin Transformer. arXiv:2106.13230.
  46. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3202–3211.
  47. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology, 30(6): 1683–1698.
  48. A Study of High Frame Rate Video Formats. IEEE Transactions on Multimedia, 21(6): 1499–1512.
  49. Video Transformer Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3163–3172.
  50. Multi-modal self-supervision from generalized data transformations. In International Conference on Learning Representations.
  51. Keeping your eye on the ball: Trajectory attention in video transformers. Advances in neural information processing systems, 34: 12493–12506.
  52. A Unified View of Masked Image Modeling. arXiv:2210.10615.
  53. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6964–6974.
  54. MAR: Masked Autoencoders for Efficient Action Recognition. arXiv:2207.11660.
  55. Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. ICML.
  56. Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  57. An Image is Worth 16x16 Words, What is a Video Worth? ArXiv, abs/2103.13915.
  58. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402.
  59. Gate-Shift Networks for Video Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  60. Video Compression - From Concepts to the H.264/AVC Standard. Proceedings of the IEEE, 93(1): 18–31.
  61. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12): 1649–1668.
  62. Masked Motion Encoding for Self-Supervised Video Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2235–2245.
  63. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems.
  64. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognition, 79: 32–43.
  65. Vision transformers for action recognition: A survey. arXiv preprint arXiv:2209.05700.
  66. Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features. IEEE Access, 6: 1155–1166.
  67. Attention Is All You Need. CoRR, abs/1706.03762.
  68. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. arXiv:2303.16727.
  69. Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14733–14743.
  70. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6312–6322.
  71. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191.
  72. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14668–14678.
  73. SimMIM: A Simple Framework for Masked Image Modeling. arXiv:2111.09886.
  74. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14063–14073.
  75. Learning for Video Compression With Recurrent Auto-Encoder and Recurrent Probability Model. IEEE Journal of Selected Topics in Signal Processing, 15(2): 388–401.
  76. A review of Convolutional-Neural-Network-based action recognition. Pattern Recognition Letters, 118: 14–22. Cooperative and Social Robots: Understanding Human Activities and Intentions.
  77. Evaluating Two-Stream CNN for Video Classification. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ICMR ’15, 435–442. New York, NY, USA: Association for Computing Machinery. ISBN 9781450332743.
  78. A highly efficient parallel algorithm for H. 264 video encoder. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 5, V–V. IEEE.
  79. Dynamic Spatial Focus for Efficient Compressed Video Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yukun Feng (7 papers)
  2. Yangming Shi (7 papers)
  3. Fengze Liu (18 papers)
  4. Tan Yan (6 papers)

Summary

We haven't generated a summary for this paper yet.