Collaboratively Self-supervised Video Representation Learning for Action Recognition (2401.07584v1)
Abstract: Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly considering generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by pulling the representations of clips and compressed key frames from the same video together while pushing apart the pairs from different videos. The third branch is designed to recover the current video frames and predict the future ones, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
- J. Ni, N. Zhou, J. Qin, Q. Wu, J. Liu, B. Li, and D. Huang, “Motion sensitive contrastive learning for self-supervised video representation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 457–474, Springer, 2022.
- K. Gavrilyuk, M. Jain, I. Karmanov, and C. G. Snoek, “Motion-augmented self-training for video recognition at smaller scale,” in ICCV, pp. 10429–10438, 2021.
- R. Qian, S. Ding, X. Liu, and D. Lin, “Static and dynamic concepts for self-supervised video representation learning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pp. 145–164, Springer, 2022.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in ICCV, pp. 6836–6846, 2021.
- C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl, “Compressed video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6026–6035, 2018.
- C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as spatiotemporal learners,” Advances in neural information processing systems, 2022.
- L. Zhang, Q. She, Z. Shen, and C. Wang, “Inter-intra variant dual representations forself-supervised video recognition,” arXiv preprint arXiv:2107.01194, 2021.
- A. Martínez-González, M. Villamizar, and J.-M. Odobez, “Pose transformers (potr): Human motion prediction with non-autoregressive transformers,” in ICCV, pp. 2276–2284, 2021.
- J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu, “Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics,” in CVPR, pp. 4006–4015, 2019.
- J. Gui, T. Chen, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends,” arXiv preprint arXiv:2301.05712, 2023.
- D. Huang, W. Wu, W. Hu, X. Liu, D. He, Z. Wu, X. Wu, M. Tan, and E. Ding, “Ascnet: Self-supervised video representation learning with appearance-speed consistency,” in ICCV, pp. 8096–8105, 2021.
- S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Rubinstein, M. Irani, and T. Dekel, “Speednet: Learning the speediness in videos,” in CVPR, pp. 9922–9931, 2020.
- R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang, L. Zhou, and L. Yuan, “Bevt: Bert pretraining of video transformers,” in CVPR, pp. 14733–14743, 2022.
- T. Pan, Y. Song, T. Yang, W. Jiang, and W. Liu, “Videomoco: Contrastive video representation learning with temporally adversarial examples,” in CVPR, pp. 11205–11214, 2021.
- P. Chen, D. Huang, D. He, X. Long, R. Zeng, S. Wen, M. Tan, and C. Gan, “Rspnet: Relative speed perception for unsupervised video representation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1045–1053, 2021.
- Y. Liu, K. Wang, L. Liu, H. Lan, and L. Lin, “Tcgl: Temporal contrastive graph for self-supervised video representation learning,” IEEE Transactions on Image Processing, vol. 31, pp. 1978–1993, 2022.
- G. Song and W. Chai, “Collaborative learning for deep neural networks,” Advances in neural information processing systems, vol. 31, 2018.
- S. Jenni and H. Jin, “Time-equivariant contrastive video representation learning,” in ICCV, pp. 9970–9980, 2021.
- D. Schneider, S. Sarfraz, A. Roitberg, and R. Stiefelhagen, “Pose-based contrastive learning for domain agnostic activity representations,” in CVPR, pp. 3433–3443, 2022.
- N. Rai, E. Adeli, K.-H. Lee, A. Gaidon, and J. C. Niebles, “Cocon: Cooperative-contrastive learning,” in CVPR, pp. 3384–3393, 2021.
- I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in ECCV, pp. 527–544, Springer, 2016.
- H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in ICCV, pp. 667–676, 2017.
- T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Workshop on Large Scale Holistic Video Understanding, ICCV, 2019.
- D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-supervised spatiotemporal learning via video clip order prediction,” in CVPR, pp. 10334–10343, 2019.
- J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video representation learning by pace prediction,” arXiv, 2020.
- J. Wang, Y. Lin, A. J. Ma, and P. C. Yuen, “Self-supervised temporal discriminative learning for video representation learning,” arXiv, 2020.
- Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye, “Video playback rate perception for self-supervised spatio-temporal representation learning,” in CVPR, pp. 6548–6557, 2020.
- D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang, “Video cloze procedure for self-supervised spatio-temporal learning,” arXiv, 2020.
- J. Wang, Y. Gao, K. Li, X. Jiang, X. Guo, R. Ji, and X. Sun, “Enhancing unsupervised video representation learning by decoupling the scene and the motion,” arXiv, 2020.
- S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, pp. 4489–4497, 2015.
- C. Feichtenhofer, “X3d: Expanding architectures for efficient video recognition,” in CVPR, pp. 203–213, 2020.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, pp. 6202–6211, 2019.
- S. Wang, H. Lu, and Z. Deng, “Fast object detection in compressed video,” in ICCV, pp. 7104–7113, 2019.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, pp. 5693–5703, 2019.
- J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in ECCV, pp. 816–833, Springer, 2016.
- P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in ICCV, pp. 2117–2126, 2017.
- L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in CVPR, pp. 12026–12035, 2019.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
- C. Caetano, F. Brémond, and W. R. Schwartz, “Skeleton image representation for 3d action recognition based on tree structure and reference joints,” in 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), pp. 16–23, IEEE, 2019.
- C. Caetano, J. Sena, F. Brémond, J. A. Dos Santos, and W. R. Schwartz, “Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition,” in 2019 16th IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 1–8, IEEE, 2019.
- V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion: Pose motion representation for action recognition,” in CVPR, pp. 7024–7033, 2018.
- S. Das, R. Dai, D. Yang, and F. Bremond, “Vpn++: Rethinking video-pose embeddings for understanding activities of daily living,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- S. Das, S. Sharma, R. Dai, F. Bremond, and M. Thonnat, “Vpn: Learning video-pose embedding for activities of daily living,” in ECCV, pp. 72–90, Springer, 2020.
- V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in ECCV, pp. 214–229, Springer, 2020.
- B. Zhang, H. Hu, and F. Sha, “Cross-modal and hierarchical modeling of video and text,” in ECCV, pp. 374–390, 2018.
- T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense predictive coding for video representation learning,” in ECCV, pp. 312–329, Springer, 2020.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- C. Shen, Y. Yin, X. Wang, X. Li, J. Song, and M. Song, “Training generative adversarial networks in one stage,” in CVPR, pp. 3350–3360, 2021.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, pp. 448–456, PMLR, 2015.
- A. Clark, J. Donahue, and K. Simonyan, “Adversarial video generation on complex datasets,” arXiv preprint arXiv:1907.06571, 2019.
- W. Li, Z. Yuan, X. Fang, and C. Wang, “Moflowgan: Video generation with flow guidance,” in 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, IEEE, 2020.
- A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in CVPR, pp. 7482–7491, 2018.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in ECCV, pp. 132–149, 2018.
- J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video representation learning by pace prediction,” in ECCV, pp. 504–521, Springer, 2020.
- Y. Huo, M. Ding, H. Lu, Z. Lu, T. Xiang, J.-R. Wen, Z. Huang, J. Jiang, S. Zhang, M. Tang, et al., “Self-supervised video representation learning with constrained spatiotemporal jigsaw,” IJCAI 2021, 2020.
- U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 179–189, IEEE, 2019.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, pp. 1597–1607, PMLR, 2020.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, pp. 9729–9738, 2020.
- O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in International conference on machine learning, pp. 4182–4192, PMLR, 2020.
- R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in ECCV, pp. 776–794, Springer, 2020.
- C. Feichtenhofer, H. Fan, B. Xiong, R. Girshick, and K. He, “A large-scale study on unsupervised spatiotemporal representation learning,” in CVPR, pp. 3299–3309, 2021.
- Y. Asano, M. Patrick, C. Rupprecht, and A. Vedaldi, “Labelling unlabelled videos from scratch with multi-modal self-supervision,” Advances in Neural Information Processing Systems, vol. 33, pp. 4660–4671, 2020.
- A. Recasens, P. Luc, J.-B. Alayrac, L. Wang, F. Strub, C. Tallec, M. Malinowski, V. Pătrăucean, F. Altché, M. Valko, et al., “Broaden your views for self-supervised video learning,” in ICCV, pp. 1255–1265, 2021.
- N. Behrmann, M. Fayyaz, J. Gall, and M. Noroozi, “Long short view feature decomposition via contrastive video representation learning,” in ICCV, pp. 9244–9253, 2021.
- I. Dave, R. Gupta, M. N. Rizve, and M. Shah, “Tclr: Temporal contrastive learning for video representation,” Computer Vision and Image Understanding, vol. 219, p. 103406, 2022.
- Y. Yu, T. Xia, H. Wang, J. Feng, and Y. Li, “Semantic-aware spatio-temporal app usage representation via graph convolutional network,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 3, pp. 1–24, 2020.
- K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in ICCV, pp. 4346–4354, 2015.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in neural information processing systems, vol. 30, 2017.
- J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in CVPR, pp. 2891–2900, 2017.
- J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows: Video forecasting by generating pose futures,” in ICCV, pp. 3332–3341, 2017.
- C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in neural information processing systems, vol. 29, 2016.
- M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in ICCV, pp. 2830–2839, 2017.
- M. Saito, S. Saito, M. Koyama, and S. Kobayashi, “Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan,” International Journal of Computer Vision, vol. 128, no. 10, pp. 2586–2606, 2020.
- N. Fushishita, A. Tejero-de Pablos, Y. Mukuta, and T. Harada, “Long-term human video generation of multiple futures using poses,” in ECCV, pp. 596–612, Springer, 2020.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, pp. 7291–7299, 2017.
- D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, pp. 6450–6459, 2018.
- D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in ICCV, pp. 5552–5561, 2019.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, pp. 6299–6308, 2017.
- M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- H. Coskun, A. Zareian, J. L. Moore, F. Tombari, and C. Wang, “Goca: Guided online cluster assignment for self-supervised video representation learning,” in ECCV, pp. 1–22, Springer, 2022.
- S. Guo, Z. Xiong, Y. Zhong, L. Wang, X. Guo, B. Han, and W. Huang, “Cross-architecture self-supervised video representation learning,” in CVPR, pp. 19270–19279, 2022.
- Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10078–10093, 2022.
- X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers. in 2021 ieee,” in CVF International Conference on Computer Vision (ICCV), pp. 9620–9629.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” in ICML, vol. 2, p. 4, 2021.
- A. Bulat, J. M. Perez Rua, S. Sudhakaran, B. Martinez, and G. Tzimiropoulos, “Space-time mixing attention for video transformer,” Advances in neural information processing systems, vol. 34, pp. 19594–19607, 2021.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678, 2022.
- R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie, and Y. Cui, “Spatiotemporal contrastive video representation learning,” in CVPR, pp. 6964–6974, 2021.
- L. Jing, X. Yang, J. Liu, and Y. Tian, “Self-supervised spatiotemporal feature learning via video rotation prediction,” arXiv, 2018.
- D. Kim, D. Cho, and I. S. Kweon, “Self-supervised video representation learning with space-time cubic puzzles,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552, 2019.
- C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video representations using contrastive bidirectional transformer,” arXiv, 2019.
- T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense predictive coding for video representation learning,” arXiv, 2020.
- C. Yang, Y. Xu, B. Dai, and B. Zhou, “Video representation learning with visual tempo consistency,” arXiv, 2020.
- M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in ECCV, pp. 69–84, Springer, 2016.
- U. Buchler, B. Brattoli, and B. Ommer, “Improving spatiotemporal self-supervision by deep reinforcement learning,” in Proceedings of the ECCV, pp. 770–786, 2018.
- H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran, “Self-supervised learning by cross-modal audio-video clustering,” arXiv, 2019.
- B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Advances in Neural Information Processing Systems, pp. 7763–7774, 2018.
- M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi, “Multi-modal self-supervision from generalized data transformations,” arXiv, 2020.
- A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in CVPR, pp. 9879–9889, 2020.
- A. Piergiovanni, A. Angelova, and M. S. Ryoo, “Evolving losses for unsupervised video representation learning,” in CVPR, pp. 133–142, 2020.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014.
- Y. Zhao, Y. Xiong, and D. Lin, “Recognize actions by disentangling components of dynamics,” in CVPR, pp. 6566–6575, 2018.
- Y. Li, Y. Li, and N. Vasconcelos, “Resound: Towards action recognition without representation bias,” in Proceedings of the ECCV, pp. 513–528, 2018.
- C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l 1 optical flow,” in Joint pattern recognition symposium, pp. 214–223, Springer, 2007.
- F. Steinbrücker, T. Pock, and D. Cremers, “Large displacement optical flow computation withoutwarping,” in ICCV, pp. 1609–1614, IEEE, 2009.
- T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in ECCV, pp. 25–36, Springer, 2004.
- H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Action Recognition by Dense Trajectories,” in IEEE Conference on Computer Vision & Pattern Recognition, (Colorado Springs, United States), pp. 3169–3176, June 2011.
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in ICCV, pp. 7464–7473, 2019.
- C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl, “Compressed video action recognition,” in CVPR, pp. 6026–6035, 2018.
- M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv, 2018.
- L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “Faceshifter: Towards high fidelity and occlusion aware face swapping,” arXiv preprint arXiv:1912.13457, 2019.
- X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, pp. 1501–1510, 2017.
- K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv, 2012.
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in ICCV, pp. 2556–2563, IEEE, 2011.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv, 2020.
- T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi-supervised learners,” arXiv, 2020.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” arXiv, 2019.
- X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv, 2020.
- F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer, “Audiovisual slowfast networks for video recognition,” arXiv, 2020.
- Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S.-F. Chang, and Z. Yan, “Dmc-net: Generating discriminative motion cues for fast compressed video action recognition,” in CVPR, pp. 1268–1277, 2019.
- B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with enhanced motion vector cnns,” in CVPR, pp. 2718–2726, 2016.
- B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with deeply transferred motion vector cnns,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2326–2339, 2018.
- D. Le Gall, “Mpeg: A video compression standard for multimedia applications,” Communications of the ACM, vol. 34, no. 4, pp. 46–58, 1991.
- A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in CVPR, pp. 5308–5317, 2016.
- W. Mao, M. Liu, M. Salzmann, and H. Li, “Learning trajectory dependencies for human motion prediction,” in ICCV, pp. 9489–9497, 2019.
- L. Fang, Q. Jiang, J. Shi, and B. Zhou, “Tpnet: Trajectory proposal network for motion prediction,” in CVPR, pp. 6797–6806, 2020.
- T. Yagi, K. Mangalam, R. Yonetani, and Y. Sato, “Future person localization in first-person videos,” in CVPR, pp. 7593–7602, 2018.
- N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F.-C. Chou, T.-H. Lin, and J. Schneider, “Motion prediction of traffic actors for autonomous driving using deep convolutional networks,” arXiv, vol. 2, 2018.
- D. Lee, Y. P. Kwon, S. McMains, and J. K. Hedrick, “Convolution neural network-based lane change intention prediction of surrounding vehicles for acc,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6, IEEE, 2017.
- E. Aksan, P. Cao, M. Kaufmann, and O. Hilliges, “Attention, please: A spatio-temporal transformer for 3d human motion prediction,” arXiv, 2020.
- C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, “Spatio-temporal graph transformer networks for pedestrian trajectory prediction,” arXiv, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
- M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” arXiv, 2020.
- M. Patrick, Y. M. Asano, P. Kuznetsova, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi, “On compositions of transformations in contrastive self-supervised learning,” in ICCV, pp. 9577–9587, 2021.
- Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” in International Conference on Learning Representations (ICLR), 2020.
- N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” arXiv, 2015.
- W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
- S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in ECCV, pp. 305–321, 2018.
- S. Ding, M. Li, T. Yang, R. Qian, H. Xu, Q. Chen, J. Wang, and H. Xiong, “Motion-aware contrastive video representation learning via foreground-background merging,” in CVPR, pp. 9716–9726, 2022.
- T. Han, W. Xie, and A. Zisserman, “Self-supervised co-training for video representation learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 5679–5690, 2020.
- R. Qian, Y. Li, H. Liu, J. See, S. Ding, X. Liu, D. Li, and W. Lin, “Enhancing self-supervised video representation learning via multi-level feature optimization,” in ICCV, pp. 7990–8001, 2021.
- J. Wang, J. Jiao, L. Bao, S. He, W. Liu, and Y.-H. Liu, “Self-supervised video representation learning by uncovering spatio-temporal statistics,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Y. Lin, X. Guo, and Y. Lu, “Self-supervised video representation learning with meta-contrastive network,” in ICCV, pp. 8239–8249, 2021.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv, 2017.
- L. Huang, Y. Liu, B. Wang, P. Pan, Y. Xu, and R. Jin, “Self-supervised video representation learning by context and motion decoupling,” in CVPR, pp. 13886–13895, 2021.
- A. Diba, V. Sharma, R. Safdari, D. Lotfi, S. Sarfraz, R. Stiefelhagen, and L. Van Gool, “Vi2clr: Video and image for visual contrastive learning of representation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1502–1512, 2021.
- A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen, “Dynamonet: Dynamic action and motion network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6192–6201, 2019.
- X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan for future-flow embedded video prediction,” in proceedings of the IEEE international conference on computer vision, pp. 1744–1752, 2017.
- P. Ghadekar, D. Khanwelkar, N. Soni, H. More, J. Rajani, and C. Vaswani, “A semi-supervised gan architecture for video classification,” in 2023 International Conference on Advances in Intelligent Computing and Applications (AICAPS), pp. 1–7, IEEE, 2023.