XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words (2310.05235v1)
Abstract: Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we fine-tune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is fine-tuned, it is used to infer new word boundary labels that are used in turn for another fine-tuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.
- Generative spoken language model based on continuous word-sized audio tokens.
- Speech sequence embeddings using nearest neighbors contrastive learning.
- DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics, 10:1051–1065.
- Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv, abs/2111.09296.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477.
- Segmental contrastive predictive coding for unsupervised word segmentation. ArXiv, abs/2106.02170.
- pyannote.audio: neural building blocks for speaker diarization.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. CoRR, abs/2110.13900.
- Xinlei Chen and Kaiming He. 2020. Exploring simple siamese representation learning. CoRR, abs/2011.10566.
- W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. CoRR, abs/2108.06209.
- Unsupervised cross-lingual representation learning for speech recognition. CoRR, abs/2006.13979.
- An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929.
- The zero resource speech challenge 2017. CoRR, abs/1712.04313.
- Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge. IEEE Journal of Selected Topics in Signal Processing, 16(6):1211–1226.
- Tzeviya Sylvia Fuchs and Yedid Hoshen. 2023. Unsupervised word segmentation using temporal gradient pseudo-labels.
- A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112:21–54.
- Bootstrap your own latent: A new approach to self-supervised learning. CoRR, abs/2006.07733.
- Distilling the knowledge in a neural network.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. CoRR, abs/2106.07447.
- Aren Jansen and Benjamin Van Durme. 2011. Efficient spoken term discovery using randomized algorithms. In 2011 IEEE Workshop on Automatic Speech Recognition Understanding, pages 401–406.
- Adaptor grammars: A framework for specifying compositional nonparametric bayesian models. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 641–648. MIT Press.
- Libri-light: A benchmark for ASR with limited or no supervision. CoRR, abs/1912.07875.
- Herman Kamper. 2018. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. CoRR, abs/1811.00403.
- Herman Kamper. 2023. Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:684–694.
- Data augmenting contrastive learning of speech representations in the time domain.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
- Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and c50 room acoustics estimation.
- Chia-ying Lee and James Glass. 2012. A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 40–49, Jeju Island, Korea. Association for Computational Linguistics.
- Unsupervised Lexicon Discovery from Acoustic Input. Transactions of the Association for Computational Linguistics, 3:389–403.
- Ilya Loshchilov and Frank Hutter. 2016. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
- Puyuan Peng and David Harwath. 2022. Word discovery in visually grounded, self-supervised speech models.
- A correspondence variational autoencoder for unsupervised acoustic word embeddings.
- Syllable discovery and cross-lingual generalization in a visually grounded, self-supervised speech mode.
- Scaling speech technology to 1,000+ languages. arXiv.
- H. Scudder. 1965. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371.
- Shane Settle and Karen Livescu. 2016. Discriminative acoustic word embeddings: Recurrent neural network-based approaches. CoRR, abs/1611.02550.
- Luke Strgar and David Harwath. 2022. Phoneme segmentation using self-supervised speech models.
- Representation learning with contrastive predictive coding. CoRR, abs/1807.03748.
- Self-training with noisy student improves imagenet classification. CoRR, abs/1911.04252.
- Billion-scale semi-supervised learning for image classification. CoRR, abs/1905.00546.
- SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
- Robin Algayres (14 papers)
- Pablo Diego-Simon (1 paper)
- Emmanuel Dupoux (81 papers)
- Benoit Sagot (9 papers)