Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning (2211.13929v5)

Published 25 Nov 2022 in cs.CV

Abstract: We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by $8\%$ to $14\%$ on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by $5.5\%$ on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of $96.5\%$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
  2. Asr is all you need: Cross-modal distillation for lip reading. In ICASSP, 2143–2147. IEEE.
  3. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 34.
  4. Self-Supervised MultiModal Versatile Networks. NeurIPS, 2(6): 7.
  5. Emotion recognition in speech using cross-modal transfer in the wild. In ACM Multimedia, 292–301.
  6. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS, 33.
  7. Look, listen and learn. In ICCV, 609–617.
  8. Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS.
  9. Soundnet: Learning sound representations from unlabeled video. NeurIPS, 29.
  10. Mae-ast: Masked autoencoding audio spectrogram transformer. arXiv preprint arXiv:2203.16691.
  11. MultiMAE: Multi-modal Multi-task Masked Autoencoders. arXiv preprint arXiv:2204.01678.
  12. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
  13. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
  14. Emerging properties in self-supervised vision transformers. In ICCV, 9650–9660.
  15. A simple framework for contrastive learning of visual representations. In ICML, 1597–1607.
  16. Exploring simple siamese representation learning. In CVPR, 15750–15758.
  17. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640–9649.
  18. Distilling audio-visual knowledge by compositional contrastive learning. In CVPR, 7016–7025.
  19. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. arXiv preprint arXiv:2204.12768.
  20. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 702–703.
  21. Learning an augmented rgb representation with cross-modal knowledge distillation for action detection. In ICCV, 13053–13064.
  22. Scaling vision transformers to 22 billion parameters. In ICML, 7480–7512. PMLR.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  25. Masked Autoencoders As Spatiotemporal Learners. arXiv preprint arXiv:2205.09113.
  26. A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 3299–3309.
  27. FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 829–852.
  28. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 776–780.
  29. Imagebind: One embedding space to bind them all. In CVPR, 15180–15190.
  30. Omnivore: A single model for many visual modalities. In CVPR, 16102–16112.
  31. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.
  32. Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3292–3306.
  33. A kernel method for the two-sample-problem. NeurIPS, 19.
  34. Bootstrap Your Own Latent: A new approach to self-supervised learning. In NeurIPS.
  35. Self-supervised Co-training for Video Representation Learning. In NeurIPS.
  36. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
  37. Masked autoencoders that listen. NeurIPS, 35: 28708–28720.
  38. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387.
  39. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  40. Cooperative learning of audio and video models from self-supervised synchronization. In NeruIPS, 7774–7785.
  41. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069.
  42. HMDB: a large video database for human motion recognition. In ICCV, 2556–2563.
  43. Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691.
  44. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  45. Active Contrastive Learning of Audio-Visual Video Representations. In ICLR.
  46. Mixed Precision Training. In ICLR.
  47. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 9879–9889.
  48. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
  49. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. arXiv preprint arXiv:2106.06939.
  50. Self-supervised learning of pretext-invariant representations. In CVPR, 6707–6717.
  51. Robust Audio-Visual Instance Discrimination. In CVPR, 12934–12945.
  52. Audio-visual instance discrimination with cross-modal agreement. In CVPR, 12475–12486.
  53. BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations. arXiv preprint arXiv:2204.07402.
  54. Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation. arXiv preprint arXiv:2204.12260.
  55. Multi-modal Self-Supervision from Generalized Data Transformations. ICCV.
  56. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning. In ICCV, 10560–10572.
  57. Piczak, K. J. 2015. ESC: Dataset for Environmental Sound Classification. In ACM Multimedia, 1015–1018. .
  58. Evolving losses for unsupervised video representation learning. In CVPR, 133–142.
  59. Spatiotemporal contrastive video representation learning. In CVPR, 6964–6974.
  60. Broaden your views for self-supervised video learning. In ICCV, 1255–1265.
  61. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
  62. Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In CVPR, 13325–13333.
  63. Imagenet large scale visual recognition challenge. IJCV, 115: 211–252.
  64. Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity. In AAAI, volume 37, 9723–9732.
  65. Self-supervised learning for videos: A survey. arXiv preprint arXiv:2207.00419.
  66. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  67. Rethinking the inception architecture for computer vision. In CVPR, 2818–2826.
  68. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 30.
  69. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv preprint arXiv:2203.12602.
  70. Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335.
  71. Bevt: Bert pretraining of video transformers. arXiv preprint arXiv:2112.01529.
  72. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740.
  73. MaCLR: Motion-Aware Contrastive Learning of Representations for Videos. In ECCV, 353–370. Springer.
  74. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 6023–6032.
  75. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
Citations (16)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com