A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors (2311.15954v1)
Abstract: In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set of topologically diverse corpora. We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations using deep generalized canonical correlation analysis. Results show the contrastive loss in the wav2vec2.0 objective facilitates more effective cross-lingual feature extraction. There is a positive correlation between PSR scores and ASR performance, suggesting that phonetic information extracted by monolingual SSL models can be used for downstream tasks in cross-lingual settings. The proposed metric is an effective indicator of the quality of the representations and can be useful for model selection.
- Syllabic reduction in mandarin and english speech. The Journal of the Acoustical Society of America, 135(6):EL270–EL276.
- Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670.
- Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
- Alexei Baevski and Abdelrahman Mohamed. 2020. Effectiveness of self-supervised pre-training for asr. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7694–7698. IEEE.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
- End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4945–4949. IEEE.
- mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374.
- Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
- Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828.
- Deep generalized canonical correlation analysis. arXiv preprint arXiv:1702.02519.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
- Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE.
- Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 521–527. IEEE.
- End-to-end continuous speech recognition using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602.
- Similarity analysis of self-supervised speech representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3040–3044. IEEE.
- An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
- Vector-quantized autoregressive predictive coding. In Proc. Interspeech 2020.
- Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv preprint arXiv:1603.00982.
- Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193.
- Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Namrata Dave. 2013. Feature extraction methods lpc, plp and mfcc in speech recognition. International journal for advance research in engineering and technology, 1(6):1–4.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. International Journal of Multimedia Information Retrieval, 9(3):135–170.
- Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pages 1764–1772. PMLR.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
- Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664.
- Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- A further study of unsupervised pretraining for transformer based speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6538–6542. IEEE.
- Improved cross-lingual transfer learning for automatic speech translation. arXiv preprint arXiv:2306.00789.
- From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633.
- An overview of the sphinx speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1):35–45.
- Jinyu Li et al. 2022. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1).
- Improving transformer-based speech recognition with unsupervised pre-training and multi-task semantic knowledge learning. In Interspeech, pages 5006–5010.
- Non-autoregressive predictive coding for learning speech representations from local dependencies. In Proc. Interspeech 2021, pages 3730–3734.
- Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366.
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE.
- Efficient self-supervised learning representations for spoken language identification. IEEE Journal of Selected Topics in Signal Processing, 16(6):1296–1307.
- Audio self-supervised learning: A survey. Patterns, 3(12):100616.
- Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering.
- End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7613–7617. IEEE.
- Automatic speech recognition: a survey. Multimedia Tools and Applications, 80(6):9411–9457.
- Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. computer vision and pattern recognition.
- An evaluation of audio feature extraction toolboxes.
- Self-supervised speech representation learning: A review. arXiv preprint arXiv:2205.10643.
- Learned in speech recognition: Contextual acoustic word embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6530–6534. IEEE.
- Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. IEEE.
- Comparative layer-wise analysis of self-supervised speech models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Deep contextualized word representations. north american chapter of the association for computational linguistics.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, CONF. IEEE Signal Processing Society.
- Archiki Prasad and Preethi Jyothi. 2020. How accents confound: Probing for accent information in end-to-end speech recognition systems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3739–3753.
- Scaling effect of self-supervised speech models. In Interspeech, pages 1084–1088.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759–766.
- Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624.
- Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6989–6993. IEEE.
- Isabel Roth. 2010. Explore the influence of french on english. Leading Undergraduate Work in English Studies, 3:255–261.
- Maurizio Serva and Filippo Petroni. 2008. Indo-european languages tree by levenshtein distance. EPL (Europhysics Letters), 81(6):68005.
- Therese S Shanthi and Chelpa Lingam. 2013. Review of feature extraction techniques in automatic speech recognition. International Journal of Scientific Engineering and Technology, 2(6):479–484.
- Xlda: Cross-lingual data augmentation for natural language inference and question answering. arXiv preprint arXiv:1905.11471.
- Connie K So and Catherine T Best. 2014. Phonetic influences on english and french listeners’assimilation of mandarin tones to native prosodic categories. Studies in Second Language Acquisition, 36(2):195–221.
- Attention is all you need. Advances in neural information processing systems, 30.
- Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
- Jing Yang. 2021. Comparison of vots in mandarin–english bilingual children and corresponding monolingual children and adults. Second Language Research, 37(1):3–26.
- Superb: Speech processing universal performance benchmark. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, pages 3161–3165. International Speech Communication Association.
- Automatic speech recognition, volume 1. Springer.
- A self-supervised model for language identification integrating phonological knowledge. Electronics, 10(18):2259.