Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 213 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning (2402.16830v1)

Published 26 Feb 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. of Interspeech, 2021.
  2. “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. of ICLR, 2020.
  3. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
  4. S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  5. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  6. “Multi-task self-supervised learning for robust speech recognition,” in Proc. of ICASSP, 2020.
  7. “MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech,” in Proc. of ICASSP 2022.
  8. S. Sadhu et al., “Wav2vec-c: A self-supervised model for speech representation learning,” in Proc. of Interspeech, 2021.
  9. Y.-A. Chung et al., “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. of ASRU, 2021.
  10. “Speech self-supervised representation benchmarking: Are we doing it right?,” in Proc. of Interspeech, 2023.
  11. S. Evain et al., “Lebenchmark: A reproducible framework for assessing self-supervised representation learning from speech,” in Proc. of Interspeech, 2021.
  12. S. W. Yang et al., “SUPERB: Speech processing Universal PERformance Benchmark,” in Proc. of Interspeech, 2021.
  13. M. Ravanelli et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021, arXiv:2106.04624.
  14. M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” in Proc. of Interspeech, 2019.
  15. “Exploring wav2vec 2.0 on speaker verification and language identification,” in Proc. of Interspeech, 2021.
  16. “Multi-task voice activated framework using self-supervised learning,” in Proc. of ICASSP, 2022.
  17. “Speech emotion diarization: Which emotion appears when?,” in Proc. of ASRU, 2023.
  18. “Optimal brain damage,” in Proc. of NIPS, 1990.
  19. “DNN Quantization with Attention,” arXiv preprint arXiv:2103.13322, 2021.
  20. “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Proc. of NIPS, 2015.
  21. “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314, 2023.
  22. “Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study,” in Proc. of ICASSP, 2023.
  23. “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in Proc. of ICASSP, 2022.
  24. “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” in Proc. Interspeech, 2022.
  25. “Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models,” in Proc. of Interspeech, 2022.
  26. “DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models,” in Proc. of Interspeech, 2023.
  27. “Librispeech: an asr corpus based on public domain audio books,” in Proc. of ICASSP, 2015.
  28. “Distilling knowledge via knowledge review,” in Proc. of CVPR, 2021.
  29. “Structured pruning of large language models,” in Proc. of EMNLP, 2020.
  30. “Structured pruning learns compact and accurate models,” in Proc. of ACL, 2022.
  31. “Structured pruning of self-supervised pre-trained models for speech recognition and understanding,” in Proc. of ICASSP, 2023.
  32. “Learning sparse neural networks through l_0 regularization,” in Proc. of ICLR, 2018.
  33. “Similarity of neural network representations revisited,” in Proc. of ICML, 2019.
  34. “A kernel statistical test of independence,” in Proc. of NIPS, 2007.
  35. A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc. of NeurIPS, 2019.
  36. J. Hwang et al., “Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch,” in Proc. of ASRU, 2023.
  37. M. Ott et al., “fairseq: A fast, extensible toolkit for sequence modeling,” in Proc. of NAACL 2019 Demo, 2019.
  38. T. Wolf et al., “Huggingface’s transformers: State-of-the-art natural language processing,” in Proc. of EMNLP, 2020.
  39. “Recycle-and-distill: Universal compression strategy for transformer-based speech ssl models with attention map reusing and masking distillation,” in Proc. Interspeech, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube